Contrastive Out-of-Distribution Detection for Pretrained Transformers

Pretrained Transformers achieve remarkable performance when training and test data are from the same distribution. However, in real-world scenarios, the model often faces out-of-distribution (OOD) instances that can cause severe semantic shift problems at inference time. Therefore, in practice, a reliable model should identify such instances, and then either reject them during inference or pass them over to models that handle another distribution. In this paper, we develop an unsupervised OOD detection method, in which only the in-distribution (ID) data are used in training. We propose to fine-tune the Transformers with a contrastive loss, which improves the compactness of representations, such that OOD instances can be better differentiated from ID ones. These OOD instances can then be accurately detected using the Mahalanobis distance in the model’s penultimate layer. We experiment with comprehensive settings and achieve near-perfect OOD detection performance, outperforming baselines drastically. We further investigate the rationales behind the improvement, finding that more compact representations through margin-based contrastive learning bring the improvement. We release our code to the community for future research.


Introduction
Many natural language classifiers are developed based on a closed-world assumption, i.e., the training and test data are sampled from the same distribution. However, training data can rarely capture the entire distribution. In real-world scenarios, outof-distribution (OOD) instances, which come from categories that are not known to the model, can often be present in inference phases. These instances could be misclassified by the model into known categories with high confidence, causing the semantic shift problem (Hsu et al., 2020). As a practical solution to this problem in real-world applications, the model should detect such instances, and signal exceptions or transmit to models handling other categories or tasks. Although pretrained Transformers (Devlin et al., 2019) achieve remarkable results when intrinsically evaluated on in-distribution (ID) data, recent work (Hendrycks et al., 2020) shows that many of these models fall short of detecting OOD instances.
Despite the importance, few attempts have been made for the problem of detecting OOD in NLP tasks. One proposed method is to train a model on both the ID and OOD data and regularize the model to produce lower confidence on OOD instances than ID ones (Hendrycks et al., 2018;Larson et al., 2019). However, as the OOD instances reside in an unbounded feature space, their distribution during inference is usually unknown. Hence, it is hard to decide which OOD instances to use in training, let alone that they may not be available in lots of scenarios. Another practiced method for OOD detection is to use the maximum class probability as an indicator (Shu et al., 2017;Hendrycks et al., 2020), such that lower values indicate more probable OOD instances. Though easy to implement, its OOD detection performance is far from perfection, as prior studies (Dhamija et al., 2018;Liang et al., 2018) show that OOD inputs can often get high probabilities as well.
In this paper, we aim at improving the OOD detection ability of natural language classifiers, in particular, the pretrained Transformers, which have been the backbones of many SOTA NLP systems. For practical purposes, we adopt the setting where only ID data are available during task-specific training. Moreover, we require that the model should maintain classification performance on the ID task data. To this end, we propose a contrastive learning framework for unsupervised OOD detection, which is composed of a contrastive loss and an OOD scoring function. Our contrastive loss aims at increasing the discrepancy of the representations of instances from different classes in the task. During training, instances belonging to the same class are regarded as pseudo-ID data while those of different classes are considered mutually pseudo-OOD data. We hypothesize that increasing inter-class discrepancies can help the model learn discriminative features for ID/OOD distinctions, and therefore help detect true OOD data at inference. We study two versions of the contrastive loss: a similaritybased contrastive loss (Sohn, 2016;Oord et al., 2018; and a margin-based contrastive loss. The OOD scoring function maps the representations of instances to OOD detection scores, indicating the likelihood of an instance being OOD. We examine different combinations of contrastive losses and OOD scoring functions, including maximum softmax probability, energy score, Mahalanobis distance, and maximum cosine similarity. Particularly, we observe that OOD scoring based on the Mahalanobis distance (Lee et al., 2018b), when incorporated with the margin-based contrastive loss, generally leads to the best OOD detection performance. The Mahalanobis distance is computed from the penultimate layer 2 of Transformers by fitting a class-conditional multivariate Gaussian distribution.
The main contributions of this work are threefold. First, we propose a contrastive learning framework for unsupervised OOD detection, where we comprehensively study combinations of different contrastive learning losses and OOD scoring functions. Second, extensive experiments on various tasks and datasets demonstrate the significant improvement our method has made to OOD detection for Transformers. Third, we provide a detailed analysis to reveal the importance of different incorporated techniques, which also identifies further challenges for this emerging research topic.

Related Work
Out-of-Distribution Detection.
Determining whether an instance is OOD is critical for the safe deployment of machine learning systems in the real world (Amodei et al., 2016). The main challenge is that the distribution of OOD data is hard to estimate a priori. Based on the availability of OOD data, recent methods can be categorized into supervised, self-supervised, and unsupervised ones.
Supervised methods train models on both ID and OOD data, where the models are expected to output a uniform distribution over known classes on OOD data (Lee et al., 2018a;Dhamija et al., 2018;Hendrycks et al., 2018). However, it is hard to assume the presence of a large dataset that provides comprehensive coverage for OOD instances in practice. Self-supervised methods (Bergman and Hoshen, 2020) apply augmentation techniques to change certain properties of data (e.g., through rotation of an image) and simultaneously learn an auxiliary model to predict the property changes (e.g., the rotation angle). Such an auxiliary model is expected to have worse generalization on OOD data which can in turn be identified by a larger loss. However, it is hard to define such transformations for natural language. Unsupervised methods use only ID data in training. They detect OOD data based on the class probabilities (Bendale and Boult, 2016;Hendrycks and Gimpel, 2017;Shu et al., 2017;Liang et al., 2018) or other latent space metrics Lee et al., 2018b). Particularly, Vyas et al. (2018) randomly split the training classes into two subsets and treat them as pseudo-ID and pseudo-OOD data, respectively. They then train an OOD detector that requires the entropy of probability distribution on pseudo-OOD data to be lower than pseudo-ID data. This process is repeated to obtain multiple OOD detectors, and their ensemble is used to detect the OOD instances. This method conducts OOD detection at the cost of high computational overhead in training redundant models and has the limitation of not supporting the detection for binary classification tasks.
Though extensively studied for computer vision (CV), OOD detection has been overlooked in NLP, and most prior works (Kim and Kim, 2018;Hendrycks et al., 2018;Tan et al., 2019) require both ID and OOD data in training. Hendrycks et al. (2020) use the maximum softmax probability as the detection score and show that pretrained Transformers exhibit better OOD detection performance than models such as LSTM (Hochreiter and Schmidhuber, 1997), while the performance is still imperfect. Our framework, as an unsupervised OOD detection approach, significantly improves the OOD detection of Transformers only using ID data.
Contrastive Learning. Recently, contrastive learning has received a lot of research attention. It works by mapping instances of the same class into a nearby region and make instances of differ-ent classes uniformly distributed . Many efforts on CV (Misra and Maaten, 2020;He et al., 2020; and NLP (Giorgi et al., 2021) incorporate contrastive learning into self-supervised learning, which seeks to gather the representations of different augmented views of the same instance and separate those of different instances. Prior work on image classification (Tack et al., 2020;Winkens et al., 2020) shows that model trained with self-supervised contrastive learning generates discriminative features for detecting distributional shifts. However, such methods heavily rely on data augmentation of instances and are hard to be applied to NLP. Other efforts on CV (Khosla et al., 2020) and NLP (Gunel et al., 2021) conduct contrastive learning in a supervised manner, which aims at embedding instances of the same class closer and separating different classes. They show that models trained with supervised contrastive learning exhibit better classification performance. To the best of our knowledge, we are the first to introduce supervised contrastive learning to OOD detection. Such a method does not rely on data augmentation, thus can be easily adapted to existing NLP models. We also propose a margin-based contrastive objective that greatly outperforms standard supervised contrastive losses.

Method
In this section, we first formally define the OOD detection problem (Sec. 3.1), then introduce the overall framework (Sec. 3.2), and finally present the contrastive representation learning and scoring functions (Sec. 3.3 and Sec. 3.4).

Problem Definition
We aim at improving the OOD detection performance of natural language classifiers that are based on pretrained Transformers, using only ID data in the main-task training. Generally, the outof-distribution (OOD) instances can be defined as instances (x, y) sampled from an underlying distribution other than the training distribution P(X train , Y train ), where X train and Y train are the training corpus and training label set, respectively. In this context, literature further divides OOD data into those with semantic shift or non-semantic shift (Hsu et al., 2020). Semantic shift refers to the instances that do not belong to Y train . More specifically, instances with semantic shift may come from unknown categories or irrelevant tasks. Therefore, the model is expected to detect and reject such instances (or forward them to models handling other tasks), instead of mistakenly classifying them into Y train . Non-semantic shift, on the other hand, refers to the instances that belong to Y train but are sampled from a distribution other than X train , e.g., a different corpus. Though drawn from OOD, those instances can be classified into Y train , thus can be accepted by the model. Hence, in the context of this paper, we primarily consider an instance (x, y) to be OOD if y / ∈ Y train , i.e., exhibiting semantic shift, to be consistent with the problem settings of prior studies (Hendrycks and Gimpel, 2017;Lee et al., 2018b;Hendrycks et al., 2020).
We hereby formally define the OOD detection task. Specifically, given a main task of natural language classification (e.g., sentence classification, NLI, etc.), for an instance x to be classified, our goal is to develop an auxiliary OOD scoring function f (x) : X → R. This function should return a low score for an ID instance where y ∈ Y train , and a high score for an OOD instance where y / ∈ Y train (y is the underlying label for x and is unknown at inference). During inference, we can set a threshold for the OOD score to filter out most OOD instances. This process involves a trade-off between false negative and false positive and may be specific to the application. Meanwhile, we expect that the OOD detection auxiliary should not negatively affect the performance of the main task on ID data.

Framework Overview
Next, we introduce the formation of our contrastive learning framework for OOD detection. We decompose OOD detection into two steps. The first step is contrastive representation learning, where we focus on learning a representation space H where the distribution of ID and that of OOD data are distinct. Accordingly, we need another function to map the representation to an OOD score. This process is equivalent to expressing OOD detection as f (x) = g(h), where h ∈ H is the dense representation of the input text x given by an encoder, g : H → R is a scoring function mapping the representation to an OOD detection score. Using this decomposition, we can use different training strategies for h and different functions for g, which are studies in the following sections.
The learning process of our framework is described in Alg. 1. In the training phase, our framework takes training and validation datasets that are both ID as input. The model is optimized with both the (main task) classification loss and the contrastive loss on batches sampled from ID training data. The best model is selected based on the ID validation data. Specifically, for a distributionbased OOD scoring function such as the Mahalanobis distance, we first need to fit the OOD detector on the ID validation data. We then evaluate the trained model on the ID validation data, where a satisfactory model should have a low contrastive loss and preserve the classification performance. In the end, our framework returns a classifier to handle the main task on ID data and an OOD detector to identify OOD instances at inference.

Contrastive Representation Learning
In this section, we discuss how to learn distinctive representations for OOD detection. For better OOD detection performance, the representation space H is supposed to minimize the overlap of the representations of ID and OOD data. In a supervised setting where both ID and OOD data are available in training, it would be easy to obtain such H. For example, Dhamija et al. (2018) train the neural model on both ID and OOD data and require the magnitude of representations of OOD instances to be smaller than ID representations. However, in realworld applications, the distribution of OOD data is usually unknown beforehand. We thus tackle a more general problem setting where the OOD data are assumed unavailable in training (unsupervised OOD detection, introduced below).
In this unsupervised setup, though all training data used are ID, they may belong to different classes. We leverage data of distinct classes to learn more discriminative features. Through a contrastive learning objective, instances of the same

< l a t e x i t s h a 1 _ b a s e 6 4 = " W G w S d 0 R W O 6 H y 1 F D e i p d c p j O m 4 M Q = " > A A A C
r 3 n n F u z s r V 6 / y O I r k k B y R E + K R C 1 I l N 6 R G 6 o S R l D y T V / J m P V k v 1 r v 1 M W s t W P n M P v k D 6 / M H 8 u 6 X S Q = = < / l a t e x i t > L ce < l a t e x i t s h a 1 _ b a s e 6 4 = " v B n g q M e k M n O q S r J m s 4 r u e p Q v h O o = " > A A A C G n i c b V B N S 8 N A E N 3 U r 1 q / q h 6 9 B I s g C C U R U Y 9 F L x 4 8 V L A f 0 J S y 2 U 7 b p Z t N 2 J 2 I J e R 3 e P G v e P G g i D f x 4 r 9 x 2 + a g b R 8 M P N 6 b Y W a e H w m u 0 X F + r N z S 8 s r q W n 6 9 s L G 5 t b 1 T 3 N 2 r 6 z B W D G o s F K F q + l S D 4 B J q y F F A M 1 J A A 1 9 A w x 9 e j / 3 G A y j N Q 3 m P o w j a A e 1 L 3 u O M o p E 6 R d c L K A 4 Y F c l t 2 k k 8 h E d M G K T p y S I 9 l J i m n W L J K T s T 2 P P E z U i J Z K h 2 i l 9 e N 2 R x A B K Z o F q 3 X C f C d k I V c i Y g L X i x h o i y I e 1 D y 1 B J A 9 D t Z P J a a h 8 Z p W v 3 Q m V K o j 1 R / 0 4 k N N B 6 F P i m c 3 y w n v X G 4 i K v F W P v s p 1 w G c U I k k 0 X 9 W J h Y 2 i P c 7 K 7 X A F D M T K E M s X N r T Y b U E U Z m j Q L J g R 3 9 u V 5 U j 8 t u + d l 9 + 6 s V L n K 4 s i T A 3 J I j o l L L k i F 3 J A q q R F G n s g L e S P v 1 r P 1 a n 1 Y n 9 P W n J X N 7 J N / s L 5 / A Z H x o x Y = < / l a t e x i t > L ce + L cont < l a t e x i t s h a 1 _ b a s e 6 4 = " P T 7 a D E m k S R h T 9 C H R n Z 0 H 1 g i + T E Q = " > A A A C A 3 i c b V B N S 8 N A E N 3 U r 1 q / o t 7 0 E i y C p 5 K I q M e i F 4 8 V 7 A c 0 I W y 2 m 3 b p Z h N 2 J 2 I J A S / + F S 8 e F P H q n / D m v 3 H T 5 q C t D w Y e 7 8 0 w M y 9 I O F N g 2 9 9 G Z W l 5 Z X W t u l 7 b 2 N z a 3 j F 3 9 z o q T i W h b R L z W P Y C r C h n g r a B A a e 9 R F I c B Z x 2 g / F 1 4 X f v q V Q s F n c w S a g X 4 a F g I S M Y t O S b B 2 6 E Y U Q w z 3 q 5 n 7 l A H y A D i Z n I c 9 + s 2 w 1 7 G S y C q 5 K I q M u i L t y 1 g n 1 A E 8 J k O m m H T h 7 M 3 I g l x I 2 / 4 s a F I m 7 9 C 3 f + j d M 2 C 2 0 9 c O F w z r 3 c e 4 + f C K 7 A s r 6 N h c W l 5 Z X V 0 l p 5 f W N z a 9 v c 2 W 2 p O J W U N W k s Y t n x i W K C R 6 w J H A T r J J K R 0 B e s 7 Q + v x n 7 7 n k n F 4 + g O R g l z Q 9 K P e M A p A S 1 5 5 r 4 T E h h Q I r J O 7 m U O s A f I 6 v X r P P f M i l W 1 J s D z x C 5 I B R V o e O a X 0 4 t p G r I I q C B K d W 0 r A T c j E j g V L C 8 7 q W I J o U P S Z 1 1 N I x I y 5 W a T D 3 J 8 p J U e D m K p K w I 8 U X 9 P Z C R U a h T 6 u n N 8 r 5 r 1 x u J / X j e F 4 M L N e J S k w C I 6 X R S k A k O M x 3 H g H p e M g h h p Q q j k + l Z M B 0 Q S C j q 0 s g 7 B n n 1 5 n r Figure 1: Illustration of our proposed contrastive loss. The contrastive loss seeks to increase the discrepancy of the representations for instances from different training classes, such that OOD instances from unknown classes can be better differentiated.
class form compact clusters, while instances of different classes are encouraged to live apart from each other beyond a certain margin, as illustrated in Fig. 1. The discriminative feature space is generalizable to OOD data, which ultimately leads to better OOD detection performance in inference when encountering an unknown distribution. We realize such a strategy using two alternatives of contrastive losses, i.e., the supervised contrastive loss and the margin-based contrastive loss.
Supervised Contrastive Loss. Different from the contrastive loss used in self-supervised representation learning He et al., 2020) that compares augmented instances to other instances, our contrastive loss contrasts instances to those from different ID classes. To give a more specific illustration of our technique, we first consider the supervised contrastive loss (Khosla et al., 2020;Gunel et al., 2021). Specifically, for a multi-class classification problem with C classes, given a batch of training instances , where x i is the input text, y i is the ground-truth label, the supervised contrastive loss can be formulated as: .., M }\{i} is the set of all anchor instances, P (i) = {p ∈ A(i) : y i = y p } is the set of anchor instances from the same class as i, τ is a temperature hyper-parameter, z is the L2normalized [CLS] embedding before the softmax layer (Khosla et al., 2020;Gunel et al., 2021). The L2 normalization is for avoiding huge values in the dot product, which may lead to unstable updates. In this case, this loss is optimized to increase the cosine similarity of instance pairs if they are from the same class and decrease it otherwise.
Margin-based Contrastive Loss. The supervised contrastive loss produces minimal gradients when the similarity difference of positive and negative instances exceeds a certain point. However, to better separate OOD instances, it is beneficial to enlarge the discrepancy between classes as much as possible. Therefore, we propose another margin-based contrastive loss. It encourages the L2 distances of instances from the same class to be as small as possible, forming compact clusters, and the L2 distances of instances from different classes to be larger than a margin. Our loss is formulated as: Here N (i) = {n ∈ A(i) : y i = y n } is the set of anchor instances from other classes than y i , h ∈ R d is the unnormalized [CLS] embedding before the softmax layer, ξ is a margin, d is the number of dimensions of h. As we do not use OOD data in training, it is hard to properly tune the margin. Hence, we further incorporate an adaptive margin. Intuitively, distances between instances from the same class should be smaller than those from different classes. Therefore, we define the margin as the maximum distance between pairs of instances from the same class in the batch: We evaluate both contrastive losses in experiments.
In training, the model is jointly optimized with the cross-entropy classification loss L ce and the contrastive loss L cont : where λ is a positive coefficient. We tune λ based on the contrastive loss and the classification performance on the ID validation set, where a selected value for λ should achieve a smaller contrastive loss while maintaining the classification performance.

OOD Scoring Functions
Next, we introduce the modeling of the OOD scoring function g. The goal of the scoring function g is to map the representations of instances to OOD detection scores, where higher scores indicate higher likelihoods for being OOD. In the following, we describe several choices of this scoring function.
Maximum Softmax Probability (MSP). Hendrycks and Gimpel (2017) use the maximum class probability 1 − max C j=1 p j among C training classes in the softmax layer as an OOD indicator. This method has been widely adopted as a baseline for OOD detection (Hendrycks and Gimpel, 2017;Hsu et al., 2020;Bergman and Hoshen, 2020;Hendrycks et al., 2020).
Energy Score (Energy).  interpret the softmax function as the ratio of the joint probability in X × Y to the probability in X , and estimates the probability density of inputs as: where w j ∈ R d is the weight of the j th class in the softmax layer, h is the input to the softmax layer. A higher g means lower probability density in ID classes and thus implies higher OOD likelihood.

Mahalanobis Distance (Maha).
Lee et al. (2018b) model the ID features with classconditional multivariate Gaussian distributions. It first fits the Gaussian distributions on the ID validation set D val = {(x i , y i )} M i=1 using the input representation h in the penultimate layer of model: where C is the number of classes, µ j is the mean vector of classes, and Σ is a shared covariance matrix of all classes. Then, given an instance x during inference, it calculates the OOD detection score as the minimum Mahalanobis distance among the C classes: where Σ + is the pseudo-inverse of Σ. The Mahalanobis distance calculates the probability density of h in the Gaussian distribution.
Cosine Similarity can also be incorporated to consider the angular similarity of input representations.
To do so, the scoring function returns the OOD score as the maximum cosine similarity of h to instances of the ID validation set: The above OOD scoring functions, combined with options of contrastive losses, lead to different variants of our framework. We evaluate each combination in experiments.

Experiments
This section presents experimental evaluations of the proposed OOD detection framework. We start by describing experimental datasets and settings (Sec.4.1 and 4.2), followed by detailed results analysis and case studies (Sec.4.3 to 4.5).

Datasets
Previous studies on OOD detection mostly focus on image classification, while few have been made on natural language. Currently, there still lacks a well-established benchmark for OOD detection in NLP. Therefore, we extend the selected datasets by Hendrycks et al. (2020) and propose a more extensive benchmark, where we use different pairs of NLP datasets as ID and OOD data, respectively. The criterion for dataset selection is that the OOD instances should not belong to ID classes. To ensure this, we refer to the label descriptions in datasets and manually inspect samples of instances.
We use the following datasets as alternatives of ID data that correspond to three natural language classification tasks: • Sentiment Analysis. We include two datasets for this task. SST2 (Socher et al., 2013) and IMDB (Maas et al., 2011) are both datasets for sentiment analysis, where the polarities of sentences are labeled either positive or negative. For SST2, the train/validation/test splits are provided in the dataset. For IMDB, we randomly sample 10% of the training instances as the validation set. Note that both datasets belong to the same task and are not considered OOD to each other.
• Topic Classification. We use 20 Newsgroup (Lang, 1995), a dataset for topic classification containing 20 classes. We randomly divide the whole dataset into an 80/10/10 split as the train/validation/test set. types of their sought-after answers. We use its coarse version with 6 classes and randomly sample 10% of the training instances as the validation set.
Moreover, for the above three tasks, any pair of datasets for different tasks can be regarded as OOD to each other. Besides, following Hendrycks et al. (2020), we also select four additional datasets solely as the OOD data: concatenations of the premises and respective hypotheses from two NLI datasets RTE (Dagan et al., 2005;Bar-Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009) and MNLI (Williams et al., 2018), the English source side of Machine Translation (MT) datasets English-German WMT16 (Bojar et al., 2016) and Multi30K (Elliott et al., 2016). We take the test splits in those datasets as OOD instances in testing. Particularly, for MNLI, we use both the matched and mismatched test sets. For Multi30K, we use the union of the flickr 2016 English test set, mscoco 2017 English test set, and filckr 2018 English test set as the test set. There are several reasons for not using them as ID data: (1) WMT16 and Multi30K are MT datasets and do not apply to a natural language classification problem. Therefore, we cannot train a classifier on these two datasets.
(2) The instances in NLI datasets are labeled either as entailment/non-entailment for RTE or entailment/neural/contradiction for MNLI, which comprehensively covers all possible relationships of two sentences. Therefore, it is hard to determine OOD instances for NLI datasets. The statistics of the datasets are shown in Tab. 1.

Experimental Settings
Evaluation Protocol. We train the model on the training split of each of the four aforementioned ID datasets in turn. In the inference phase, the respective test split of that dataset is used as ID test data, while all the test splits of datasets from other  tasks are treated as OOD test data. We adopt two metrics that are commonly used for measuring OOD detection performance in machine learning research (Hendrycks and Gimpel, 2017;Lee et al., 2018b): (1) AUROC is the area under the receiver operating characteristic curve, which plots the true positive rate (TPR) against the false positive rate (FPR). A higher AUROC value indicates better OOD detection performance, and a random guessing detector corresponds to an AUROC of 50%.
(2) FAR95 is the probability for a negative example (OOD) to be mistakenly classified as positive (ID) when the TPR is 95%, in which case a lower value indicates better performance. Both metrics are threshold-independent.
Compared Methods. We evaluate all configurations of contrastive losses and OOD scoring functions. Those include 12 settings composed of 3 alternative setups for contrastive losses (L scl , L margin or w/o a contrastive loss) and 4 alternatives of OOD scoring functions (MSP, the energy score, Maha, or cosine similarity).
Model Configuration. We implement our framework upon Huggingface's Transformers (Wolf et al., 2020) and build the text classifier based on RoBERTa LARGE (Liu et al., 2019) in the main experiment. All models are optimized with Adam (Kingma and Ba, 2015) using a learning rate of 1e−5, with a linear learning rate decay towards 0. We use a batch size of 32 and fine-tune the model for 10 epochs. When training the model on each training split of a dataset, we use the respective validation split for both hyper-parameter tuning and The hyper-parameters are tuned according to the classification performance and the contrastive loss on the ID validation set. We find that τ = 0.3 and λ = 2 work well with L scl , while λ = 2 work well with L margin , and we apply them to all datasets.

Main Results
We hereby discuss the main results of the OOD detection performance. Note that the incorporation of our OOD techniques does not lead to noticeable interference of the main-task performance, for which an analysis is later given in Sec. 4.5.
The OOD detection results by different configurations of models are given in Tab. 2. For all results, we report the average of 5 runs using different random seeds. Each model configuration is reported with separate sets of results when being trained on different datasets, on top of which the macro average performance is also reported. For settings with L scl and L margin , results better than the baselines (w/o a contrastive loss) are marked as red. We observe that: (1) Among OOD detection functions, the Mahalanobis distance performs the best on average and drastically outperforms the MSP baseline used in Hendrycks et al. (2020). This is due to that the Mahalanobis distance can better capture the distributional difference. (2) Considering models trained on different ID datasets, the model variants with L margin have achieved near-perfect OOD detection performance on SST2, IMDB, and TREC-10. While on the 20 Newsgroup dataset that contains articles from multiple genres, there is still room for improvement.
(3) Overall, The margin-based contrastive loss (L margin ) significantly improves OOD detection performance. Particularly, it performs L cont L scl L margLn < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 l Y b i D 9 q b 3 K j O Y Q x W A T z j D t W 6 W Y = " > A A A C C H i c b V C 7 T g M x E P S F V w i v A 0 o K L B I k q n C X A i g j a C g o g k Q e U h J F P s d J r P j s k 7 0 H R K c r a f g V G g o Q o u U T 6 P g b n E c B g Z F W G s 3 s a n c n i A Q 3 4 H l f T m Z h c W l 5 J b u a W 1 v f 2 N x y t 3 d q R s W a s i p V Q u l G Q A w T X L I q c B C s E W l G w k C w e j C 8 G P v 1 W 6 Y N V / I G R h F r h 6 Q v e Y 9 T A l b q u P t 3 x w o X W i G B A S U i u U o 7 S Q v Y P S R U S U j T Q s f N e 0 V v A v y X + D O S R z N U O u 5 n q 6 t o H D I J V B B j m r 4 X Q T s h G j g V L M 2 1 Y s M i Q o e k z 5 q W S h I y 0 0 4 m j 6 T 4 0 C p d 3 F P a l g Q 8 U X 9 O J C Q 0 Z h Q G t n N 8 s Z n 3 x u J / X j O G 3 l k 7 4 T K K g U k 6 X d S L B Q a F x 6 n g L t e M g h h Z Q q j m 9 l Z M B 0 Q T C j a 7 n A 3 B n 3 / 5 L 6 m V i v 5 J 0 b 8 u 5 c v n s z i y a A 8 d o C P k o 1 N U R p e o g q q I o g f 0 h F 7 Q q / P o P D t v z v u 0 N e P M Z n b R L z g f 3 w h V m g A = < / l a t e x i t > w/o L cont < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 k l J g y 8 G i M z Y + C D h k 0 k 4 s s 9 N N 6 o = " > A A A C B n i c b V C 7 T s N A E D y H V w g v A y V C O p E g U Q U 7 B V B G 0 F B Q B I k 8 p D i K z p d z c s r 5 o b s 1 E F m u a P g V G g o Q o u U b 6 P g b z o k L C I y 0 0 m h m V 7 s 7 b i S 4 A s v 6 M g o L i 0 v L K 8 X V 0 t r 6 x u a W u b 3 T U m E s K W v S U I S y 4 x L F B A 9 Y E z g I 1 o k k I 7 4 r W N s d X 2 R + + 5 Z J x c P g B i Y R 6 / l k G H C P U w J a 6 p v 7 d 8 e 4 4 v g E R p S I 5 C r t J w 6 w e 0 g U F W l a 6 Z t l q 2 p N g f 8 S O y d l l K P R N z + d Q U h j n w V A B V G q a 1 s R 9 B I i g V P B 0 p I T K x Y R O i Z D 1 t U 0 I D 5 T v W T 6 R o o P t T L A X i h 1 B Y C n 6 s + J h P h K T X x X d 2 Y H q 3 k v E / / z u j F 4 Z 7 2 E B 1 E M L K C z R V 4 s M I Q 4 y w Q P u G Q U x E Q T Q i X X t 2 I 6 I p J Q 0 M m V d A j 2 / M t / S a t W t U + q 9 n W t X D / P 4 y i i P X S A j p C N T l E d X a I G a i K K H t A T e k G v x q P x b L w Z 7 7 P W g p H P 7 K J f M D 6 + A V C V m Q s = < / l a t e x i t > w/ L scl < l a t e x i t s h a 1 _ b a s e 6 4 = " z o U J + B / E E L 3 2 0 q I Q 7 8 D t / B 3 v 8 e U = " > A A A C C X i c b V C 7 T s N A E D y H V w i v A C X N i Q S J K t g p g D K C h o I i S O Q h J Z Z 1 v l y S U 8 5 n 6 2 4 N R J Z b G n 6 F h g K E a P k D O v 6 G S + I C E k Z a a T S z q 9 0 d P x J c g 2 1 / W 7 m l 5 Z X V t f x 6 Y W N z a 3 u n u L v X 1 G G s K G v Q U I S q 7 R P N B J e s A R w E a 0 e K k c A X r O W P L i d + 6 4 4 p z U N 5 C + O I u Q E Z S N 7 n l I C R v C K + P 8 H l b k B g S I l I r l M v 6 Q J 7 g C Q g a s B l m p a 9 Y s m u 2 F P g R e J k p I Q y 1 L 3 i V 7 c X 0 j h g E q g g W n c c O w I 3 I Q o 4 F S w t d G P N I k J H Z M A 6 h k o S M O 0 m 0 0 9 S f G S U H u 6 H y p Q E P F V / T y Q k 0 H o c + K Z z c r O e 9 y b i f 1 4 n h v 6 5 m 3 A Z x c A k n S 3 q x w J D i C e x 4 B 5 X j I I Y G 0 K o 4 u Z W T I d E E Q o m v I I J w Z l / e Z E 0 q x X n t O L c V E u 1 i y y O P D p A h + g Y O e g M 1 d A V q q M G o u g R P a N X 9 G Y 9 W S / W u / U x a 8 1 Z 2 c w + + g P r 8 w f C h Z p l < / l a t e x i t > w/ L margin Figure 2: Visualization of the representations for positive, negative instances in SST2 and OOD ones. The discrepancy between ID and OOD representations is greater on representations obtained with L margin .  the best with the Mahalanobis distance, reducing the average FAR95 of Maha by 77% from 7.3% to 1.7%. (4) The supervised contrastive loss (L scl ) does not effectively improve OOD detection in general. In many cases, its performance is even worse than the baseline.

Novel Class Detection
We further evaluate our framework in a more challenging setting of novel class detection. Given a dataset containing multiple classes (≥ 3), We randomly reserve one class as OOD data while treating others as ID data. We then train the model on the ID data and require it to identify OOD data in inference. In this case, the OOD data are sampled from the same task corpus as the ID data, and thus is much harder to be distinguished. We report the average performance of 5 trials in Tab. 3. The results are consistent with the main results in general. The Mahalanobis distance consistently outperforms consistently outperforms MSP, and the L margin achieves better performance except for the FAR95 metric on the TREC-10 dataset. However, the performance gain is notably smaller than that in the main experiments. Moreover, none of the compared methods achieve an AUROC score of over 80%. This experiment shows that compared to detecting OOD instances from other tasks, detecting OOD instances from similar corpora is much more challenging and remains room for further investigation.    with the margin-based contrastive loss L margin . Results are shown in Tab. 5. Due to space limitations, we only report the average OOD performance on the four ID datasets. We observe that the three metrics achieve similar performance, and all outperform the baseline when using Maha as the scoring function. Among them, L2 distance gets slightly better OOD detection performance. Moreover, L margin significantly outperforms L scl when both use cosine as the distance metric. It shows that their performance difference arises from the characteristics of the losses instead of the metric.
OOD Detection by Other Transformers. We also evaluate the OOD detection ability of other pretrained Transformers in Tab. 6 and report the average performance on the four ID datasets. For BERT (Devlin et al., 2019), we use λ = 0.2. We observe that: (1) Larger models have better OOD detection ability. For both BERT and RoBERTa, the large versions offer better results than the base versions.
(2) Pretraining on diverse data improves OOD detection. RoBERTa, which uses more pretraining corpora, outperforms BERT models. (3) The margin-based contrastive loss consistently improves OOD detection on all encoders.

Conclusion
This work presents an unsupervised OOD detection framework for pretrained Transformers requiring only ID data. We systematically investigate the combination of contrastive losses and scoring functions, the two key components in our framework. In particular, we propose a margin-based contrastive objective for learning compact representations, which, in combination with the Mahalanobis distance, achieves the best performance: near-perfect OOD detection on various tasks and datasets. We further propose novel class detection as the future challenge for OOD detection.

Ethical Consideration
This work does not present any direct societal consequences. The proposed work seeks to develop a general contrastive learning framework that handles unsupervised OOD detection in natural language classification. We believe this study leads to intellectual merits that benefit with reliable application of NLU models. Since in real-world scenarios, a model may face heterogeneous inputs with significant semantic shifts from its training distributions.
And it potentially has broad impacts since the tackled issues also widely exist in tasks of other areas.

A Full Results
We show the full OOD detection performance of ID datasets on OOD datasets. The results of w/o L cont and w/ L margin are shown in Tab. 7 and Tab. 8, respectively.