Holistic Sentence Embeddings for Better Out-of-Distribution Detection

Detecting out-of-distribution (OOD) instances is significant for the safe deployment of NLP models. Among recent textual OOD detection works based on pretrained language models (PLMs), distance-based methods have shown superior performance. However, they estimate sample distance scores in the last-layer CLS embedding space and thus do not make full use of linguistic information underlying in PLMs. To address the issue, we propose to boost OOD detection by deriving more holistic sentence embeddings. On the basis of the observations that token averaging and layer combination contribute to improving OOD detection, we propose a simple embedding approach named Avg-Avg, which averages all token representations from each intermediate layer as the sentence embedding and significantly surpasses the state-of-the-art on a comprehensive suite of benchmarks by a 9.33% FAR95 margin. Furthermore, our analysis demonstrates that it indeed helps preserve general linguistic knowledge in fine-tuned PLMs and substantially benefits detecting background shifts. The simple yet effective embedding method can be applied to fine-tuned PLMs with negligible extra costs, providing a free gain in OOD detection. Our code is available at https://github.com/lancopku/Avg-Avg.


Introduction
Pretrained language models have achieved remarkable performance on various NLP tasks under the assumption that the train and test samples are drawn from the same distribution (Wang et al., 2019).However, in real-life applications such as dialogue systems and clinical text processing, it is inevitable for models to make predictions on out-ofdistribution (OOD) samples, which may result in fatally unreasonable predictions (Hendrycks et al., 2020).Therefore, it is crucial for fine-tuned PLMs to automatically detect OOD inputs.
Among recent works on textual OOD detection, distance-based methods have received much attention due to their superior performance (Podolskiy et al., 2021;Zhou et al., 2021).They calculate the sample distance to the training-data distribution as the uncertainty measure for OOD detection.In these approaches, the distance scores are usually calculated in the space of the last-layer CLS vectors (i.e., the inputs to the classification head) produced by fine-tuned PLMs.As known, the CLS embedding space is optimized for the indistribution classification task during fine-tuning, thus not necessarily optimal for OOD detection.
In this paper, we investigate how to derive sentence embeddings suitable for OOD detection from fine-tuned PLMs.Motivated by the token averaging and layer combination techniques proposed to enhance unsupervised sentence embeddings (Su et al., 2021;Huang et al., 2021b), we apply them to OOD detection and make two intriguing empirical findings: (1) averaging all token representations outperforms the standard practice only using the CLS vector; (2) combining token representations from all intermediate layers brings further improvements.These observations lead to an extremely simple yet effective pooling technique: averaging all token representations in each intermediate layer as the sentence embedding for OOD detection.
We name the all-layer-all-token pooling technique Avg-Avg and demonstrate that it consistently uplifts the OOD detection performance of BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) models on a comprehensive suite of textual OOD detection benchmarks.Further investigations into the rationales behind the improvement show that Avg-Avg effectively helps reserve general linguistic information in the feature space and benefits detecting background shifts.In summary, our proposal serves as a plug-and-play post-processing technique to improve the capability of fine-tuned PLMs to detect OOD instances and reveals that it is a promising direction to boost textual OOD detection via deriving more holistic representations.
2 Avg-Avg: Holistic Sentence Embedding for Better OOD Detection

Preliminaries
Modern pretrained language models have been developed based on the Transformer (Vaswani et al., 2017) are the embedding vectors for each token in S in the i-th Transformer layer and H 0 denotes the static token embeddings.

Methodology
In the pretraining-finetuning paradigm, the CLS token is usually put at the beginning of S, and the corresponding vector produced by the last Transformer layer h 1 L is fed into the classification head for fine-tuning.In existing works, the CLS vector h 1 L is regarded as the sentence representation, and OOD detection is conducted in the corresponding embedding space (Podolskiy et al., 2021;Zhou et al., 2021).Such a practice does not fully exploit linguistic information contained in H. Consequently, we resort to two pooling strategies to derive more holistic sentence representations: • Intra-Layer Token Averaging: For the i-th layer, we average hidden vectors for all tokens as the pooled representation P i , i.e., P i = 1 n ∑ n j=1 h j i , to replace the default P i = h 1 i .• Inter-Layer Combination: For given intermediate pooled representations P 1 , P 2 , . . ., P L , we perform layer combination to obtain the final pooled sentence representation P for OOD detection: , where M ⊆ {1, 2, . . ., L} denotes the subset of intermediate layers for combination.
In our embedding approach Avg-Avg, token averaging is performed for intra-layer pooling; all layers are chosen for layer combination, in other words, M = {1, 2, . . ., L}.Table 1 shows the rationality of our choice: for a RoBERTa-based model fine-tuned on the SST-2 sentiment analysis dataset, Avg-Avg significantly outperforms other pooling strategies for detecting 20 Newsgroup (20NG) sam-  ples as OOD data, including the default last-layer CLS pooling and the first-last-avg pooling used for unsupervised sentence embedding (Su et al., 2021).

Experimental Setup
Benchmarks Following Zhou et al. (2021), we choose four datasets corresponding to three tasks as the in-distribution (ID) datasets: SST-2 (Socher et al., 2013) and IMDB (Maas et al., 2011) for sentiment analysis, TREC-10 ( Li and Roth, 2002) for question classification, and 20 Newsgroups (Lang, 1995) for topic classification.Among the four, any pair of datasets coming from different tasks is regarded as an ID-OOD.Besides, we use four additional datasets as OOD test data for each ID dataset: WMT-16 (Bojar et al., 2016), Multi30k (Elliott et al., 2016), RTE (Dagan et al., 2005), and SNLI (Bowman et al., 2015).More details of these datasets can be found in Appendix A.1.

Model Configuration
We build text classifiers by fine-tuning the RoBERTa-base model (Liu et al., 2019)   2018; Podolskiy et al., 2021), and MD combined with contrastive targets (Zhou et al., 2021).See Appendix D for the introduction and implementation details of these baseline methods.

Overall Results
Table 2 gives main results.Except the contrastivebased tuning method (Zhou et al., 2021), all methods use the same model vanilla fine-tuned on the ID training set.Our methods use the Mahalanobis distance to obtain OOD scores, following Podolskiy et al. ( 2021) (the only difference lies in the embedding space).We find that compared to the baseline calculating MD in the last-layer CLS embedding space, both token averaging and layer combination bring improvements on almost all benchmarks.When the two techniques are combined, namely Avg-Avg is applied, the performance continues to grow and exceeds the previous state-of-theart (Zhou et al., 2021) that needs extra contrastive targets in the fine-tuning stage, by a considerable margin of 9.33% FAR95 averaged on four benchmarks.Further experiments on other PLM backbones also substantiate the enhancement brought by our method, supported by the results in Table 3.

Analysis
The Impact of Layer Choice To verify the rationality of choosing all intermediate layers for inter-layer combination, we show the maximum AUROC values corresponding to different numbers of intermediate layers to derive sentence embeddings in Figure 1.As the number of layers grows, the AUROC metric first increases and then remains relatively stable when more than four layers are chosen.Notably, the peak appears when 7 layers are combined, only 0.3% higher than our Avg-Avg.
Since searching for the best combination of layers is infeasible due to the unavailability of OOD data, using all layers is a sensible choice.
Probing Analysis Given that intermediate layers of PLMs contain a rich hierarchy of linguistic information (Jawahar et al., 2019), a plausible explanation of the performance lift is that Avg-Avg leads to an embedding space containing more gen-  Detecting Different Kinds of Shifts OOD texts can be categorized by whether they exhibit a background shift or a semantic shift (Arora et al., 2021).
In previous main experiments, ID and OOD data come from different tasks and both kinds of shifts exist.To explore the source of the performance growth, we conduct ablation experiments by evaluating our method in settings where background or semantic shifts dominate.For the semantic shift setting, we use the News Category (Misra, 2018) and CLINC (Larson et al., 2019a) datasets (ID and OOD parts share the same background distribution, but belong to different classes); for the background shift setting, we regard SST-2 as ID and IMDB, Customer Reviews (CR for short) (Hu and Liu, 2004) as OOD (they all belong to the sentiment analysis task but differ in background features, e.g, the length and style).Refer to Appendix A.2 for dataset details.As shown in  that the performance gain mainly comes from the task-agnostic general linguistic information in the holistic embeddings obtained by our pooling technique, in line with the probing analysis.

Comparison with Universal Sentence Embedding Approaches
Here we further show the advantage of our method Avg-Avg over two representative universal sentence embedding approaches, SentenceBERT (SBERT) (Reimers and Gurevych, 2019) and Sim-CSE (Gao et al., 2021) on OOD detection.For SBERT, we test the model trained on NLI (natural language inference) data (last-layer mean pooling is adopted as recommended in the original work); for SimCSE, we test the unsupervised model and the supervised model trained on NLI data (lastlayer CLS pooling is adopted).The backbone model is RoBERTa-base in all methods.We also fine-tune the pre-trained models on the ID data and use the default pooling ways and our Avg-Avg to obtain embeddings from fine-tuned models for thorough comparison on OOD detection.As results in Table 6, when Avg-Avg is applied, it brings consistent improvements and beats both pre-trained and fine-tuned sentence embedding models using the default pooling way.These results corroborate the advantage of Avg-Avg as a specialized embedding method for OOD detection.

Embedding Visualization
To demonstrate the influence of the studied pooling strategies on the embedding space, we fine-tune the RoBERTa-base model on SST-2 and visualize instance embeddings corresponding to different pooling strategies from the SST-2 test set (ID) and an OOD test set (20 Newsgroups) using t-SNE (Van der Maaten and Hinton, 2008).As plotted in Figure 2, in the representation space produced by Avg-Avg (Figure 2(b)) where is almost no overlap between ID and OOD instances, ID and OOD samples are more sharply separated than in the space of default last-layer CLS embeddings (Figure 2(a)).This further supports our claim that Avg-Avg is better suited for OOD detection.
4 Related Works

Unsupervised Sentence Embedding
Unsupervised sentence embedding is a wellestablished area (Kiros et al., 2015;Pagliardini et al., 2017;Li et al., 2020;Reimers and Gurevych, 2019;Gao et al., 2021).Relevant to our work, Su et al. ( 2021) and Huang et al. (2021b) proposed to obtain better sentence embeddings via averaging token representations, layer combination, and a whitening operation.It is noteworthy that these embedding approaches are mainly studied for sentence matching and retrieval tasks.As far as we know, we are the first to study novel embedding ways to replace the default last-layer CLS pooling for boosting textual OOD detection.

Conclusion
In this work, we focus on how to derive sentence embeddings suitable for OOD detection from finetuned PLMs.Specifically, we introduce token averaging and layer combination to derive more holistic representations and substantially improve the capability of PLMs to detect OOD inputs.Moreover, our analysis shows that our approach helps preserve general linguistic information and benefits detecting background shifts.Overall, our work points out a new perspective that textual OOD detection can be enhanced by obtaining high-quality sentence embeddings, and we hope to extend this idea to training-time methods in future work.

Limitations
The contemporary solution Avg-Avg is primarily motivated by empirical observations and its effectiveness is confirmed by extensive experiments on different PLMs and benchmarks.Currently, its su-periority lacks strict theoretical justifications and there is still a small performance gap between our method and the ideal upper bound as shown in Figure 1.In future work, we plan to explore theoryguaranteed embedding approaches to further boost the OOD detection ability of PLMs.

Ethical Considerations
Our work presents an efficient embedding method to enhance the OOD detection ability of NLP models.We believe that our proposal will help reduce security risks resulting from OOD inputs to NLP models deployed in the open-world environment. In

A.1 Datasets Used in Main Experiments
The statistics of in-distribution (ID) and out-ofdistribution (OOD) textual datasets in main experiments (Section 3.1 and 3.2), including the number of classes, the dataset size, and the average length of samples, are given in Table 7 and 8, respectively.Here is a brief introduction to these datasets: Multi30k (Elliott et al., 2016) and WMT16 (Bojar et al., 2016) are parts of the English side data of English-German machine translation datasets; RTE (Dagan et al., 2005) and SNLI (Bowman et al., 2015) are the concatenations of the precise and respective hypotheses from NLI datasets.

A.2 Datasets Used In the Distribution Shift Analysis
Arora et al. ( 2021) categorized the distribution shifts in natural language data into two main types: background shifts and semantic shifts.We follow their division and study OOD detection performance under the setting where either kind of shift dominates in Section 3.3.The statistics of extra datasets used in the distribution shift analysis are given in Table 9.Here is a brief introduction to these datasets.
Background Shift Setting.Background shifts refer to the shift of background features (e.g., formality) that do not depend on the label.We consider domain shifts in sentiment classification datasets.SST-2 contains short movie reviews by the audience, while IMDB contains longer and more professional movie reviews.Customer Reviews (Hu and Liu, 2004) contains reviews for different kinds of commercial products on the web, representing a domain shift from SST-2.So the IMDB and Customer Reviews test data can be regarded as OOD samples for the model fine-tuned on SST-2.
Semantic Shift Setting.In this setting, OOD data are from the same task as ID data and share similar background characteristics, but belong to classes unseen during training.We use the News Category (Misra, 2018) and the CLINC (Larson et al., 2019b) datasets to create two ID/OOD pairs under the setting.Following Arora et al. (2021), we use the data from the five most frequent classes of the News Category as ID (News Top-5) and the data from the remaining 36 classes as OOD (News Rest).In the CLINC dataset for intent classification, there is a 150-class ID subset and an OOD test set CLINC OOD composed of utterances belonging to actions not supported by existing ID intents.

A.3 Probing Benchmarks
To probe the linguistic information contained in sentence embeddings, we use the probing tasks proposed by Conneau et al. (2018), which are grouped into three categories.For surface information, we use SentLen (sentence length) and WC (the presence of words); for syntactic information, we use BShift (sensitivity to word order), TreeDepth (the depth of the syntactic tree), and TopConst (the sequence of top-level constituents); for semantic information, we use Tense (tense), SubjNum and Ob-jNum (the subject/direct object number in the main clause), SOMO (the sensitivity to random replacement of a noun/verb), and CoordInv (the random swapping of coordinated clausal conjuncts).Each probing dataset contains 100k training samples, 10k validation samples, and 10k test samples.We use the SentEval toolkit (Conneau and Kiela, 2018) along with the recommended hyperparameter space to search for the best probing classifier according to the validation accuracy and report test accuracies.

B Details of Pretrained Language Model Fine-tuning B.1 Vanilla Fine-tuning
We use the RoBERTa-base pretrained model (Liu et al., 2019) as the backbone to build text classifiers by fine-tuning it on the ID training data.We use a batch size of 16 and fine-tune the model for 5 epochs.The model is optimized with the Adam (Kingma and Ba, 2015) optimizer using a learning rate of 2e-5.We evaluate the model on the ID development set after every epoch and choose the best checkpoint as the final model.The setting is the same for other pretrained Transformers studied in the paper (RoBERTa-large, BERT-base-uncased, DistilRoBERTa-base, and ALBERT-base).Distil-RoBERTa (Sanh et al., 2019) is a light distilled RoBERTa and ALBERT (Lan et al., 2019)  , where x i is the input and y i is the label, the supervised contrastive loss term L scl and the final optimization target L can be formulated as: where A(i) = {1, ..., M } ∖ {i} is the set of all anchor instances, P (i) = {p ∈ A(i) ∶ y i = y p } is the set of anchor instances from the same class as i, τ is a temperature hyper-parameter, z is the L2-normalized CLS embedding before the softmax layer, L ce is the cross entropy loss, and λ is a positive coefficient.Following Zhou et al. (2021), we use τ = 0.3 and λ = 2.
The margin-based loss term L margin and the final optimization target L can be formulated as: (2) Here N (i) = {n ∈ A(i) ∶ y i ≠ y n } is the set of anchor instances from other classes than y i , h ∈ R d is the unnormalized CLS embedding before the softmax layer, ξ is the margin, d is the number of dimensions of h, and λ is a positive coefficient.We use λ = 2 following Zhou et al. (2021).
Except for the loss term, we use the same hyperparameters for these two tuning methods as vanilla tuning.Table 10 gives test accuracies on four ID datasets for the RoBERTa models tuned with vanilla cross-entropy loss (L ce ), supervised contrastive loss (L ce + L scl ), and margin-based contrastive loss (L ce + L margin ), where are not significant differences.

B.3 Hardware Requirements
All the experiments (fine-tuning and inference) in this paper are conducted on a single NVIDIA TI-TAN RTX GPU, except that the fine-tuning of the RoBERTa-large model needs 4 TITAN RTX GPUs.

C Definition of Evaluation Metrics for OOD Detection
For an input instance x, the output of an OOD detector is the confidence score S(x) (a higher confidence score).A higher confidence score indicates that the detector tends to regard x as a normal ID sample.In real applications, system users need to choose a threshold γ and treat the OOD detection module as a binary classifier: Following previous works (Hendrycks and Gimpel, 2017;Zhou et al., 2021), we use the following two threshold-free metrics for evaluation: Loss AUROC is short for the area under the receiver operating curve.It plots the true positive rate (TPR) against the false positive rate (FPR) and can be interpreted as the probability that the model ranks a random positive(ID) example more highly than a random negative (OOD) example.A higher AU-ROC indicates better OOD detection performance.
FAR95 is the probability for a negative example (OOD) to be mistakenly classified as positive (ID) when the TPR is 95%.A lower value indicates better detection performance.x) , and S(x) = −E(x).

D.3 LOF
Lin and Xu (2019) proposed identifying unknown user intents by feeding feature vectors to the density-based novelty detection algorithm, local outlier factor (LOF) (Breunig et al., 2000).We use the last-layer CLS vectors produced by the finetuned RoBERTa models as the input and train a LOF model following the implementation details of Lin and Xu (2019)

E Comparison with Score Ensemble
Apart from the layer combination technique studied in the paper, there is another way to utilize intermediate representations for OOD detection: estimating the sample distance score in the embedding space of each intermediate layer and taking their sum as the final OOD score.For the Mahalanobis distance score, the final ensemble score S(x) is defined as: where ψ ℓ (x) denotes the output features at the ℓthlayer of neural networks, and µ ℓ and Σ ℓ are the class mean and the covariance matrix, correspondingly.The layer-wise weighting hyperparameter is α ℓ .In the original work (Lee et al., 2018), α ℓ is tuned on a small validation set containing both ID and OOD for each OOD dataset, which is impractical in the setting of unsupervised OOD detection followed by recent works (OOD data is not available).Following Hsu et al. ( 2020), we use uniform weighting, i.e., S(x) = ∑ ℓ S ℓ (x), in the baselines for comparison.
We compare the performance of SE (score ensemble) and Avg-Avg and show the results in Table 11.We observe that SE also brings consistent improvements over the baseline using only last-layer CLS vectors.Without token averaging, SE slightly surpasses Avg-Avg on IMDB and TREC-10, but underperforms Avg-Avg significantly on SST-2 and 20NG; when token averaging is performed, SE only beats Avg-Avg on 20NG but underperforms on other three benchmarks, especially remarkably on SST-2.In view of the average performance on the four benchmarks, we can get Avg-Avg > SE (AVG) > SE (CLS).Considering that the class mean µ ℓ and the inverse of covariance matrix Σ −1 ℓ need to be estimated and stored for each layer in SE, Avg-Avg is also more convenient for deployment.So compared with SE, Avg-Avg enjoys both simplicity and advantages in performance.

Figure 1 :
Figure 1: Maximum AUROC (averaged over 6 OOD datasets) values corresponding to sentence embeddings from a RoBERTa model fine-tuned on SST-2 with different numbers of combining layers.The maximum values are results searched on the test data.Token averaging is performed for intra-layer pooling.

Figure 2 :
Figure 2: Visualization of the representations obtained for positive, negative instances in SST-2 and OOD ones (20 Newsgroups).

D. 1
MSPMSP(Hendrycks and Gimpel, 2017) is a classical baseline using the maximum softmax probability in the prediction outputs of the classifier as the confidence score, i.e., S(x) = max y∈Υ p y (x).D.2 Energy ScoreLiu et al. (2020) proposed using free energy as a scoring function for OOD detection.For a classification problem with C classes, a multi-class classifier f (x) ∶ X → R C can be interpreted from an energy-based perspective by viewing the logit output f y i (x) corresponding to class y i as an energy function E(x, y

D. 4
Mahalanobis DistanceMahalanobis distance score(Lee et al., 2018) is a representative distance-based OOD detection algorithm, which uses the sample distance to the nearest ID class in the embedding space as the OOD uncertainty measure.For a given feature extractor ψ, the Mahalanobis distance score is defined as:S(x) = max c∈Υ − (ψ(x) − µ c ) T Σ −1 (ψ(x) − µ c ), where ψ(x) is the embedding vector of the input x, µ c is the class centroid for a class c, and Σ is the covariance matrix.The estimation of µ c and Σ is defined as follows:x) − µc) (ψ(x) − µc) T ,(4) whereD c in = {x | (x, y) ∈ D in , y = c} denotes the training samples belonging to the class c, N is the size of the training set, and N c is the number of training instances belonging to class c.

Table 1 :
(Lee et al., 2018) different pooling strategies on the SST-2 v.s.20NG benchmark.Mahalanobis distance(Lee et al., 2018)is the OOD detection method.Avg denotes token average pooling and L12 denotes the 12th layer (the last) of the RoBERTa model.These results are exploratory and the superiority of Avg-Avg will be further confirmed by following experiments.

Table 2 :
Zhou et al. (2021)results of previous OOD detection methods and ours on four benchmarks.↑indicateslarger is better and ↓ indicates lower is better.For each ID dataset, we report the macro average of AUROC / FAR95 values on all corresponding OOD datasets.All values are percentages averaged over five times with different random seeds, and the best results are highlighted in bold.L scl and L margin denote the contrastive and margin-based auxiliary targets proposed byZhou et al. (2021), respectively.

Table 3 :
The improvements brought by Avg-Avg compared to the MD baseline(Podolskiy et al., 2021)for different PLMs.AUROC values are reported (the number in the bracket is the improvement).

Table 4 :
Probing task performance for representations corresponding to different pooling strategies.All values are percentages averaged over five RoBERTa models fine-tuned with different random seeds.

Table 5 :
Performance (AUROC) on different kinds of distribution shifts, corresponding to the MD baseline and our proposed Avg-Avg.All values are percentages areraged over five different random seeds.
Conneau et al. (2018)mation, where ID and OOD data are more sharply separated.To verify this, we evaluate the sentence embeddings produced by the RoBERTa model fine-tuned on SST-2 corresponding to different pooling strategies on the probing tasks proposed byConneau et al. (2018)(details in Appendix A.3).As shown in Table4, our proposed method consistently raises the probing accuracies of surface, syntactic, and semantic level probing tasks, suggesting that we obtain more holistic embeddings by integrating intermediate hidden states.

Table 5
, our method drastically strengthens the capability of detecting background shifts; in contrast, it only slightly improves detecting semantic shifts, which indicates

Table 6 :
The OOD detection performance (FAR95) of different embedding approaches (lower FAR95 values indicate better detection performance).The "ft" subscript denotes that the embedding model is fine-tuned on the in-distribution data for classification.

Table 7 :
addition, all experiments in this work are conducted on open datasets and our code is publicly available.While we do not expect any direct negative consequences to the work, we hope to continue exploring more efficient and robust sentence embedding approaches for textual OOD detection in future work.Statistics of in-distribution text datasets.L denotes the average length of samples.

Table 8 :
Statistics of out-of-distribution text datasets.L denotes the average length of samples.

Table 9 :
Statistics of extra datasets introduced for the distribution shift analysis.L denotes the average length of each sample.

Table 11 :
on the ID training set and use the local density output as S(x).Comparison between score ensemble (SE) and Avg-Avg.The setting of backbone models and ID/OOD benchmarks is the same as that in Table2.