Unsupervised Out-of-Domain Detection via Pre-trained Transformers

Deployed real-world machine learning applications are often subject to uncontrolled and even potentially malicious inputs. Those out-of-domain inputs can lead to unpredictable outputs and sometimes catastrophic safety issues. Prior studies on out-of-domain detection require in-domain task labels and are limited to supervised classification scenarios. Our work tackles the problem of detecting out-of-domain samples with only unsupervised in-domain data. We utilize the latent representations of pre-trained transformers and propose a simple yet effective method to transform features across all layers to construct out-of-domain detectors efficiently. Two domain-specific fine-tuning approaches are further proposed to boost detection accuracy. Our empirical evaluations of related methods on two datasets validate that our method greatly improves out-of-domain detection ability in a more general scenario.


Introduction
Deep neural networks, despite achieving good performance on many challenging tasks, can make overconfident predictions for completely irrelevant and out-of-domain (OOD) inputs, leading to significant AI safety issues (Hendrycks and Gimpel, 2017). Detecting out-of-domain inputs is a fundamental task for trustworthy AI applications in realworld use cases, because those applications are often subject to ill-defined queries or even potentially malicious inputs. Prior work on out-of-domain detection (e.g., Hendrycks and Gimpel, 2017;Lee et al., 2018;Liang et al., 2018;Hendrycks et al., 2019Hendrycks et al., , 2020Xu et al., 2020) mostly requires indomain task labels, limiting its usage to supervised classification. However, deployed applica-tions rarely receive controlled inputs and are susceptible to an ever-evolving set of user inputs that are scarcely labeled. For example, for many nonclassification tasks, such as summarization or topic modeling, there are no available classifiers or task labels, which limits the practical usage of recently proposed out-of-domain detection methods. Therefore, it is natural to ask the following question: Can we detect out-of-domain samples using only unsupervised data without any in-domain labels?
We regard the out-of-domain detection problem as checking whether the given test samples are drawn from the same distribution that generates the in-domain samples, which requires a weaker assumption than prior work (e.g., Lee et al., 2018;Hendrycks et al., 2020). We suppose that there are only in-domain samples, which allows us to understand the properties of data itself regardless of tasks. Therefore, methods developed for this problem are more applicable than task-specific ones and can be further adapted to tasks where no classification labels are present, such as active learning or transfer learning.
To solve the problem, we utilize the latent embeddings of pre-trained transformers (e.g., Vaswani et al., 2017;Devlin et al., 2019;Liu et al., 2019) to represent the input data, which allow us to apply classical OOD detection methods such as one-class support vector machines (Schölkopf et al., 2001) or support vector data description (Tax and Duin, 2004) on them.
However, the best practice on how to extract features from BERT is usually task-specific. For supervised classification, we can represent the text sequence using the hidden state of [CLS] token from the top layer. Meanwhile BERT's intermediate layers also capture rich linguistic information that may outperform the top layer for specific NLP tasks. By performing probing tasks on each layer, Jawahar et al. (2019) suggest bottom layers of BERT capture more surface features, middle layers focus more on syntax and semantic features are well represented by top ones.
As no prior knowledge about OOD samples is usually provided in practice, deciding which layer of features is the most effective for OOD detection is itself non-trivial. Some OOD samples may just contain a few out-of-vocabulary words; while others are OOD due to their syntax or semantics.
Based on the observations above, this paper studies how to leverage all-layer features from a pretrained transformer for OOD detection in an unsupervised manner. Our contributions are three-fold: • By analyzing all layers of (Ro)BERT(a) models, we empirically validate that it is hard to extract features from a certain layer that work well for any OOD datasets.
• We propose a computationally efficient way to transform all-layer features of a pre-trained transformer into a low-dimension one. We empirically validate that the proposed method outperforms baselines that use one-layer features or by simple aggregations of all layers.
• We propose two different techniques for finetuning a pre-trained transformer to further improve its capability of detecting OOD data.

Problem Setup
Assume that we have a collection of text inputs D n := {x i } n i=1 , we want to construct an out-ofdomain detector that takes an unseen new input u and determines whether u u u comes from the same distribution that generates D n . We adopt a more practical setting where we have no prior knowledge of what out-of-domain inputs look like. In this case, training a domain classifier directly is not feasible. The out-of-domain detector can be described mathematically as: where I(·) denotes the anomaly score function, and is a chosen threshold to ensure that the true positive rate is at a certain level (e.g., 95%) (Hendrycks and Gimpel, 2017;Liang et al., 2018;Lee et al., 2018). The OOD detection problem boils down to designing I(·) such that it assigns in-domain inputs lower scores than out-of-domain inputs.
There are two different scenarios, considering if we have any in-domain labels for data x i ∈ D n . Here we define in-domain labels as any specific supervised task labels, such as sentiments, intents or topics of the text.
With in-domain labels Suppose that we have multi-class label y . Given a classifier h trained with D n , we can use maximum calibrated softmax probability with temperature scaling as the anomaly score (Liang et al., 2018;Hinton et al., 2015): is the output logits of the multi-class classifier, and T is the temperature that is selected such that the true positive rate is at a given rate (e.g., 95% in Liang et al. (2018)). This method is known as Maximum Softmax Probability (MSP), which requires multi-class labels to train a classifier and thus limits its application in practice. We argue that requiring in-domain labels is a less practical scenario for OOD detection and will not be further discussed it in this paper.
Without in-domain labels The setting of no indomain labels is our major focus. Under this assumptin, the models we can obtain in hand are usually not classifiers, but feature extractors instead. Then it is natural to resort to classic outlier detection methods like one-class support vector machine (Schölkopf et al., 2001), support vector data description (Tax and Duin, 2004) or kernel density estimation (KDE) for estimating the support or the density of the in-domain data distribution.
When applying such methods to text data, the major focus of prior work is to design a good network structure or learning objectives (Ruff et al., 2018). Instead, in this paper we mainly focus on how to obtain good representations from pre-trained transformers and design new anomaly scores without modifying its structure, while still obtaining good OOD detection performance.

Model and Feature Learning
BERT and its variants such as RoBERTa (e.g., Devlin et al., 2019;Liu et al., 2019) are pre-trained on large-scale public data (denoted as D pub ) using self-supervised tasks, such as language model and next sentence prediction. These models show igure 1: An overview of using Mahalanobis distance features (MDF) extracted from a pre-trained transformer f to detect out-of-domain data. We estimate meanĉ l and covariance matrixΣ l for each layer of f by samples from an unsupervised training set D n ; and then extract MDF of D n to optimize a OC-SVM. Given an unseen test sample, its feature M is extracted usingĉ l andΣ l and then fed into OC-SVM for an anomaly score. Two domain-specific fine-tuning methods, IMLM and BCAD, can be further applied to BERT to boost detection accuracy.
promising results when transferred to tasks in other domains. We aim to leverage features obtained from pre-trained transformers to construct OOD detectors in lieu of in-domain labels in D n .

BERT features for OOD detection
After pretraining, we can obtain a BERT/RoBERTa model f with L layers. We denote f (x) ∈ R d as the d-dimensional feature embeddings corresponding to the -th layer for input x, and f (x x x) is the overall representation using all layers of f . We explore the following methods to extract BERT features to construct OOD detectors.
Features from the -th layer f Options to extract f l (x x x) include using the hidden states of [CLS] token or averaging all contextualized token embeddings at the -layer. Then we can directly construct an OOD detector based on features from f of each input x x x in D n using existing pure sample based methods, such as one-class support vector machine (OC-SVM). 2 Features from all layers Using BERT features from only one layer might not be sufficient, as prior work (Jawahar et al., 2019) has explored that different layers of BERT capture distinct linguistic properties, e.g., lower-level features capturing lexical properties, middle layers representing syntactic properties, and semantic properties surfacing in higher layers. The effects of BERT features from different layers on detecting OOD data are yet to be investigated. One straightforward way that leverages all L layers is to concatenate all layerwise features f (x x x), which has no information loss. However, this solution is computationally expensive and thus hard to optimize OC-SVM or kernel based methods. Another solution is to perform aggregation likes max-or mean-pooling along the feature dimension across all layers, sacrificing some information in exchange for efficiency.
In this paper, we propose a simple yet effective method (described below) to use latent representations from all layers of a pre-trained transformer and can automatically decide features from which layers are important. Besides, this method is computationally efficient, only requiring us to solve a low-dimensional constrained convex optimization.
Mahalanobis distance as features (MDF) for all layers Support Vector Data Description (SVDD) (Tax and Duin, 2004) is a technique related to OC-SVM where a hypersphere is used to separate the data instead of a hyperplane. However, the features provided by deep models may not be separable by hyperspheres. We focus on a generalization of the hypersphere called hyper-ellipsoid to account for such surface shapes.
Suppose that we use the concatenated features from all layers ∈ R d·L and consider the following optimization problem to find the hyper-ellipsoid, which is similar to the optimization formula of SVDD: where Φ is the feature map, c is the center of the hyper-ellipsoid, and Σ is a symmetric positive definite matrix that reflects the shape of the ellipsoid. And R reflects the volume of the hyper-ellipsoid. 3 Here we also introduce a regularization term 1 2 Σ 2 Fr to constrain the complexity of Σ. If Σ = I, then the optimization problem is identical to one-class SVDD.
Solving Eq (1) exactly can be difficult, since it involves finding the optimal Σ of shape D×D, where D = d · L is the dimension of the features. For the concatenated features Φ(x), D can be tens of thousands or even hundreds of thousands, which makes the exact solution computationally intractable. To tackle the problem, we consider a simple and computationally efficient approximation of the solution, which can be useful in practice.
First, we decompose the feature space into several subspaces, based on the features from different layers, i.e., assume Σ is a block diagonal matrix, and Σ reflects the shape of feature distribution at layer . By a straightforward calculation, we have: where we decompose the center c to be the center of each layer c = [c 1 , . . . , c L ] . Still, optimizing c and Σ can be difficult since the dimension of f (x) can be high. Based on the intuition that c and Σ should not deviate from the empirical mean and covariance estimation c and Σ from the training data, we can replace c and Σ with the following approximation: where w is a layer-dependent constant. Now we only need to find proper {w } L =1 as well as the corresponding R and {ξ i } n i=1 , which is a low-dimension optimization problem that only scales linearly with the number of layer L. We further define: 3 We can further assume Σ = 1, where the norm can be the operator norm or Frobenius norm, which can give the definition of the hyper-ellipsoid with unique Σ and R.
where the square root of M (x x x i ) is also referred to as the Mahalanobis distance of the features of data we instead minimize − 1 2 w 2 2 , which has a similar regularization effect on Σ (as we don't want w 2 to be small, which can make Σ Fr very large). So the final optimization problem to solve is: which in fact is a one-class SVM with a linear kernel, with Mahalanobis distance of each layers as features (MDF), and it can be simply solved with the standard convex optimization. We illustrate our proposed algorithm in Figure 1.
Remark Note that the optimization in Eq (2) is not identical as that in Eq (1), since we are using empirical sample mean { c } L =1 and covariance { Σ /w } L =1 to replace the original parameters c and Σ in Eq (1), which are hard to optimize when the dimension of the concatenated features Φ(x) is high. Also, our approximation from Eq (1) to Eq (2) is different from the known result that when Φ(x) is the infinite-dimensional feature map of the widely used Gaussian RBF kernels, OC-SVM and SVDD are equivalent and asymptotically consistent density estimators (Tsybakov et al., 1997;Vert et al., 2006). In our case, Φ(x) is the concatenated features from all layers of pre-trained transformers, which makes our approximation fundamentally different from prior work.

Feature fine-tuning
We can also fine-tune the pre-trained transformer f on the unsupervised in-domain dataset D n so that f (x x x) can better represent the distribution of D n . We explore two domain-specific fine-tuning approaches.
In-domain masked language modeling (IMLM) Gururangan et al. (2020) find that domain-adaptive masked language modeling (Devlin et al., 2019) would improve supervised classification capability of BERT when it is transferred to that domain. Similarly, we can do MLM on D n and argue this would make the features of D n concentrate, bringing benefits to downstream OOD detection.
Binary classification with auxiliary dataset (BCAD) Another way of fine-tuning the model f is to use the public dataset D pub that pretrains it. We consider the training data in D n as in-domain positive samples and data in the public dataset D pub as OOD negative samples. We add a new classification layer on top of f and update this layer together with all parameters of f by performing a binary classification task. In practice, we only need a small subset of D pub , denoted asD pub , for finetuning. SinceD pub is publicly available and has no labels, we do not violate the unsupervised setting. D pub does not provide any information about the OOD samples at test time as well. Besides, the added classification layer can actually be applied for OOD detection using the MSP method, and this is exactly the setting of zero-shot classification, which we use as a baseline for comparison in our experiments.

Experiments
Datasets We consider two distinct datasets for experiments, where one is to regard text from unseen corpora as OOD, and the other one is to detect class-level OOD samples within the same corpus.
• Cross-corpus dataset (SST) We follow the experimental setting in Hendrycks et al. (2020), by providing in-domain D n with the original training set of SST dataset (Socher et al., 2013) and considering samples from four other datasets (i.e., 20 Newsgroups (Lang, 1995), English-German Multi30K (Elliott et al., 2016), RTE (Dagan et al., 2005) and SNLI (Bowman et al., 2015)) as OOD data. For evaluation, we use the original test data of SST as in-domain positives and randomly pick 500 samples from each of the four datasets as OOD negatives. We do not include any sentiment labels from SST to D n for training.
• Cross-intent dataset (CLINIC150) This is a crowdsourced dialog dataset (Larson et al., 2019), including in-domain queries covering 150 intents and out-of-domain queries that do not fall within any of the 150 intents. We use all 15,000 queries that are originally in its training data as in-domain samples but discard their intent labels. For evaluation, we mix the 4,500 unseen in-domain test queries with 1,000 out-of-domain queries and wish to separate two sets by their anomaly scores.
Examples taken from the two datasets can be found in Table 1. Note that for both datasets, only the in-domain samples are used for training, and the source/intent labels are not used in our experiments.

Evaluation metrics
We rank all test samples by their anomaly scores and follow Liang et al. (2018) to report four different metrics, namely, Area Under the Receiver Operating Characteristic Curve (AUROC), Detection Accuracy (DTACC), and Area under the Precision-Recall curve (AUPR) for in-domain and out-of-domain testing sentences respectively, denoted by AUIN and AUOUT.
Model configurations We evaluate all methods with both BERT and RoBERTa (base models with 768 latent dimensions and 12 layers).
Choice ofD pub for BCAD We adopt the BooksCorpus (Zhu et al., 2015) and English Wikipedia, which are the sources used in common by BERT and RoBERTa for pre-training. We split paragraphs into sentences and sampleD pub to have the same size as D n for BCAD.
Baselines To examine the effectiveness of our newly proposed anomaly score based on MDF that utilizes the representations of all layers, we compare it with the following baselines.
• (Ro)BERT(a)-Single layer: It uses f (x x x) mentioned above. We iterate all 12 layers and detailed results of each layer are discussed in Section 5.1.
• (Ro)BERT(a)-Mean pooling: we construct alllayer representation by averaging all f (x x x), which has 768 dimensions.
• (Ro)BERT(a)-Max pooling: we aggregate all layers by picking largest values along each feature dimension and get a 768-dimension vector.
• (Ro)BERT(a)-Euclidean distance as features (EDF): we replace Mahalanobis distance with Euclidean distance and still obtain a 12-dimension vector.
• TF-IDF: we extract TF-IDF features and adopt SVD to reduce high-dimensional features to 100 dimensions for computational efficiency.
All of the above methods extract features as the input to OC-SVM to compute anomaly scores.
• BCAD + MSP: It performs zero-shot classification after BCAD fine-tuning, as discussed in Section 3. The temperature scaling is tuned to achieve the best result. This method is not applicable when noD pub is provided.

Results and Discussions
In this section, we present the results for our experiments and summarize our findings.  (AVG) at each layer of (Ro)BERT(a) models in the cross-corpus and the cross-intent dataset. We observe that detecting cross-intent OOD samples in CLINIC150 is more challenging than that of cross-dataset OOD data in SST. This is mainly because the OOD samples in CLINIC150 are sorted by humans and the differences between intents can be subtle. We will further compare the performance of these two settings in Figure 2.

Using single-layer feature f (x)
The best f (x x x) for OOD is dataset-specific For the cross-corpus dataset (SST), we find that the best results come from the top layer of both (Ro)BERT(a). However, for the cross-intent dataset (CLINIC150), the middle layers perform the best when using [CLS], while the bottom layers achieve the best results with AVG. This indicates that OOD distributions are not simply based on certain types of linguistic features and the strategy of choosing f (x x x) is dataset-specific; for some dataset, semantic features play a more important role, while sometimes we need to focus on syntactic or lexical features. This validates the assumption that it is beneficial to fully utilize all layers of the hidden representations from pre-trained transformers to detect OOD instances.
We find using f (x x x) of BERT is generally better  Table 3: OOD detection performance on SST and CLINIC 150 for all models. OC-SVM is used for computing anomaly scores except MSP, and its parameters size is #feats. For (Ro)BERT(a)+Single-layer, the best results in Table 2 are reported. For all MDF-based model, we only report results of AVG as sequence representation at each layer due to space limit. Larger values of all four metrics indicate better performances. The best result for each metric is marked in bold.
than RoBERTa, especially with [CLS]. We guess next sentence prediction may cause this, which pretrains on [CLS] and is exclusive for BERT. In later sections, (Ro)BERT(a)-Single layer will refer to the best one in Table 2.

Overall OOD detection performance
We report the empirical results of OOD detection in Table 3 and the following observations.
Pre-trained transformers produce good feature representations Methods using single-layer feature f outperforms frequency-based features (TF-IDF) and zero-shot classification (MSP), which validates the strong representation capability granted by self-supervised pre-training.
Simple aggregations of all layers are not so effective The results of max-pooling and meanpolling are not very promising. Even though we observe an absolute 0.5% boost in SST using maxpooling, using the best single layer actually outperforms those simple aggregations in CLINIC150.
MDF is more effective MDF consistently outperforms methods that directly use features f (x x x), simple aggregations of f (x x x), or TF-IDF features on all four metrics. In terms of AUROC, MDF outperforms the best single-layer of (Ro)BERT(a) by absolute 7.1% on SST and 14.0% on CLINIC150.
MDF also performs better than EDF. Note that Euclidean distance is a special case of Mahalanobis distance when the covariance is an identity matrix. Empirically, the features generated by neural models are not invariant across all dimensions; and the comparison between MDF and EDF validates SVDD with a hyper-ellipsoid is better than a hypersphere.
MDF is more efficient in training OC-SVM Notice that our approach is also more computationally efficient when obtaining optimal w w w and R R R since the optimization is performed on a new transformed low dimensional data space (d = 12 is number of layers in f ). See column #feats in Table 3 for detailed comparisons.   Table 4: Examples of CLINIC150 with predictions from three models, which is "In" when sample's anomaly score is lower than 25th percentile and "Out" when larger than 75th percentile. GT is the ground truth and Single stands for BERT-Single.

Fine-tuning techniques improve performance
From Table 3, we can see both MILM and BCAD improve OOD detection performance when incorporated with MDF separately. The overall best detecting performance is achieved by MILM + BCAD + MDF, combining both proposed fine-tuning methods with MDF. We also find that RoBERTa outperforms BERT when using MDF, even though features from a single layer prefers BERT in Table 2.

Visualizations
We plot the ROC curves of four different anomaly scores on SST in Figure 2 (a) and on CLINIC150 in Figure 2 (c), confirming that our proposed MDF and two fine-tuning techniques improve the ability in detecting OOD samples. We also present the distributions of anomaly scores I(x) generated by our best method in Figure 2 (b) for SST and in Figure 2 (d) for CLINIC150. For SST, the OOD detector can clearly separate I(x) of in-domain and out-domain samples, and the in-domain scores are densely concentrated on the low-score region. Although for CLINIC150, we do observe some OOD samples mixing with in-domain ones, accounting for the gap of metric scores between two datasets.

Case Studies
We present some examples from CLINIC150 together with their corresponding predictions by TF-IDF, BERT-single layer and MDF methods in Table 4. TF-IDF predicts false positives for examples (b) and (d) because most of the words in the exam-ple test query are seen in the training set, like "i would like you to buy me some paper plates" (intent: order), "i need to know how long to cook chicken for" (intent: cooking time) and etc. BERTsingle layer learns the syntax of "can you tell me how to ...", which is frequently seen in the training data, but it fails to discern that the semantic meaning is out-of-domain. For example (d), all models make the mistake, potentially associating it with the intent: recipe ("i need to find a good way to make chicken soup" or "what's the best way to make chicken stir fry").

Related Work
Out-of-domain detection is essentially an important component for trustworthy machine learning applications. There are two lines of work proposed to perform out-of-domain detection. One is to tackle the problem in specific multi-class classification tasks, where well-trained classifiers are utilized to design anomaly scores (e.g., Hendrycks and Gimpel, 2017;Liang et al., 2018;Lee et al., 2018;Card et al., 2019;Hendrycks et al., 2020;Xu et al., 2020), Those methods can only be useful when multi-class labels are available, which limits their application in more general domains. Our proposed work goes beyond this limitation and can utilize large amounts of unsupervised data.
Another line of work is based on support estimation or density estimation, which assumes that the in-domain data is in specific support or from the high density region (Schölkopf et al., 2001;Tax and Duin, 2004). In principle, our work is closely related to this line of work. Besides, Zhai et al. (2016); Ruff et al. (2018); Zong et al. (2018) also leverage the features of neural networks, though these methods require designing specific network structures for different data. Our work circumvents the issues of prior work by designing a computationally efficient method that leverages the powerful representations of pre-trained transformers.
Finally, the fine-tuning techniques we use to improve the representation of data are closely related to unsupervised pre-training for transformers (Devlin et al., 2019;Yang et al., 2019), and recently proposed contrastive learning (e.g., Chen et al., 2020). Lately, Gururangan et al. (2020) discover that performing pre-training (MLM) on the target domain with unlabeled data can also help to improve downstream classification performance. To the best of our knowledge, our method is the first to incorporate transformers and pre-training techniques to improve out-of-domain detection.

Conclusion
We study the problem of detecting out-of-domain samples with unsupervised in-domain data, which is a more general setting for out-of-domain detection. We propose a simple yet effective method using Mahalanobis distance as features, which significantly improves the detection ability and reduces computational cost in learning the detector. Two domain-adaptive fine-tuning techniques are further explored to boost the detection performance.
In the future, we are interested in deploying our OOD method to real-world applications, such as detecting unseen new classes for incremental fewshot learning (Zhang et al., 2020;Xia et al., 2021) or filtering OOD samples in data augmentations.