Enhancing Out-of-Distribution Detection in Natural Language Understanding via Implicit Layer Ensemble

Out-of-distribution (OOD) detection aims to discern outliers from the intended data distribution, which is crucial to maintaining high reliability and a good user experience. Most recent studies in OOD detection utilize the information from a single representation that resides in the penultimate layer to determine whether the input is anomalous or not. Although such a method is straightforward, the potential of diverse information in the intermediate layers is overlooked. In this paper, we propose a novel framework based on contrastive learning that encourages intermediate features to learn layer-specialized representations and assembles them implicitly into a single representation to absorb rich information in the pre-trained language model. Extensive experiments in various intent classification and OOD datasets demonstrate that our approach is significantly more effective than other works.


Introduction
Natural language understanding (NLU) in dialog systems, which often formalizes as a classification task to identify intentions behind user input, is a vital component as their decision propagates to the downstream pipelines.Numerous works have achieved immense success on sundry tasks (e.g., intention classification, NLI, QA) reaching parity with human performance (Wang et al., 2019).Despite their success in many different benchmarks, neural models are known to be vulnerable to test inputs from an unknown distribution (Hendrycks and Gimpel, 2017;Hein et al., 2019), commonly referred to as outliers, since they depend strongly on the closed-world assumption (i.e., I.I.D assumption).Thus, out-of-distribution (OOD) detection (Aggarwal, 2017), which aims to discern outliers from the train distribution, is a essential research Figure 1: Layer-wise performances and their explicit ensemble (Shen et al., 2021) performance on BERT-base.Explicit ensemble often lead to worse AUROC (higher the better) than using a single well-performing layer.Detailed explanations about setting and baseline model are elaborated in Sec.4.2.1 and Sec.4.3 individually.problem for ensuring a high-quality user experience and maintaining strong reliability as the systems in the wild encounter myriad unseen data ceaselessly.
The most prevailing paradigm in OOD detection is to extract and score.Namely, it extracts the representation of the input from a neural model and passes it to a pre-defined scoring function.Then, the scoring function gauges the appropriateness of the input based on the extracted feature and decides whether the input is from the normal distribution.The most common rule of thumb for extracting representation from neural models is employing the the last layer, a simple and intuitive way to obtain a holistic representation, which is universally utilized in broad machine learning areas.
Meanwhile, previous studies (Tenney et al., 2019;Clark et al., 2019) revealed that the middle layers of the language model also conceal copious information.For instance, prior studies on language model probing suggest that syntactic linguistic knowledge is most prominent in the middle layers (Hewitt and Manning, 2019;Goldberg, 2019;Jawahar et al., 2019), and semantic knowledge in BERT is spread in all layers widely (Tenney et al., 2019).In this regard, leveraging intermediate lay-ers can lead to a better OOD detection performance, as they retain some complementary information to the last layer feature, which might be beneficial in discriminating outliers.Several studies (Shen et al., 2021;Sastry and Oore, 2020;Lee et al., 2018b) have shown empirical evidence that intermediate representations are indeed beneficial in detecting outliers.Precisely, they attempted to utilize middle layers via naïvely aggregating the individual result of every single intermediate feature explicitly.
Although previous studies have shown the potential of intermediate layer representations in OOD detection, we confirmed that the aforementioned naïve ensemble scheme spawns several problems: (Fig. 1 illustrates OOD performance of the layerwise and their explicit ensemble in two different datasets.)The first problem we observed is that the ensemble result (red bar) nor the last layer can not guarantee the best performance among the entire layer depending on the setting.Such a phenomenon raises the necessity for a more elaborate approach of deriving a more meaningful ensemble representation from various representations rather than a current simple summation or selecting a single layer.Secondly, even when this explicit ensemble gives a sound performance, it requires multiple computations of the scoring function by birth.Thus, explicit ensemble inevitably delays the detecting time, which is a critical shortcoming in OOD detection, as swift and precise decisionmaking is the cornerstone in this area.
To remedy the limitations of the explicit ensemble schemes, we propose a novel framework dubbed Layer-agnostic Contrastive Learning (LaCL).Our framework is inspired by the foundation of an ensemble, which seeks a more calibrated output by combining heterogeneous decisions from multiple models (Kuncheva and Whitaker, 2003;Gashler et al., 2008).Specifically, LaCL regards intermediate layers as independent decision-makers and assembles them into a single vector to yield a more accurate prediction: LaCL makes middlelayer representations richer and more diverse by injecting the advantage of contrastive learning (CL) into intermediate layers while suppressing interlayer representations from being similar through additional regularization loss.Then, LaCL assembles them into a single ensemble representation implicitly to circumvent multiple computations of the scoring function.
We demonstrate the effectiveness of our ap-proach in 9 different OOD scenarios where LaCL consistently surpasses other competitive works and their explicit ensemble performance by a significant margin.Moreover, we conducted an in-depth analysis of LaCL to elucidate its behavior in conjunction with our intuition.

Related Work
OOD detection.Methodologies in OOD detection can be divided into supervised (Hendrycks et al., 2019;Lee et al., 2018a;Dhamija et al., 2018) and unsupervised settings according to the presence of training data from OOD.Since the scope of OOD covers nigh infinite space, gathering the data in the whole OOD space is infeasible.For this realistic reason, the most recent OOD detection studies generally discriminate OOD input in an unsupervised manner, including this work.Numerous branches of machine learning tactics are employed for unsupervised OOD detection: generating pseudo-OOD data (Chen and Yu, 2021;Zheng et al., 2020), Bayesian methods (Malinin and Gales, 2018), self-supervised learning based approaches (Moon et al., 2021;Manolache et al., 2021;Li et al., 2021;Zhou et al., 2021;Zeng et al., 2021;Zhan et al., 2021), and novel scoring functions which measure the uncertainty of the given input (Hendrycks and Gimpel, 2017;Lee et al., 2018b;Liu et al., 2020;Tack et al., 2020).
Contrastive learning & OOD detection.Among the numerous approaches mentioned, contrastive learning (CL) based methods (Chen et al., 2020;Zbontar et al., 2021;Grill et al., 2020) are recently spurring predominant interest in OOD detection research.The superiority of CL in OOD detection comes from the fact that it can guide a neural model to learn semantic similarity within data instances.Such property is also precious for unsupervised OOD detection, as there is no accessible clue regarding outliers or abnormal distribution.Despite its potential, CL has been utilized in the computer vision field (Cho et al., 2021;Sehwag et al., 2021;Tack et al., 2020;Winkens et al., 2020) in the early works due to its high reliance on data augmentation.However, now it is also widely used in various NLP applications with the help of recent progress (Li et al., 2021;Liu et al., 2021;Kim et al., 2021;Carlsson et al., 2020;Gao et al., 2021;Sennrich et al., 2016).Specifically, Li et al. (2021) verified that CL is also helpful in the NLP field, and Zhou et al. (2021); Zeng et al. (2021) redesigned the contrastive-learning objective into a more appropriate form for OOD detection.Potential of intermediate representation.The leading driver of the recent upheaval in NLP is the pre-trained language model (PLM), such as BERT (Devlin et al., 2019) and GPT (Radford et al., 2018), which trains a large-scale dataset on a transformer-based architecture (Vaswani et al., 2017).Numerous studies attempted to reveal the role and characteristics of each layer in PLMs and verified that diverse information is concealed in the middle layer, which is now a pervasive notion in the machine learning community.For instance, Tenney et al. (2019) showed that the different layers of the BERT network could resolve syntactic and semantic structure within a sentence.Clark et al. (2019) proposed an attention-based probing classifier leveraging syntactic information in the middle layer of BERT.Several studies (Shen et al., 2021;Sastry and Oore, 2020;Lee et al., 2018b) have shown the potential of intermediate representations in OOD detection by explicitly aggregating the individual result of every single intermediate feature.

Intuition
The prime objective of our framework is to assemble rich information in the entire layers into a single ensemble representation to derive a more reliable decision.Inspired by the foundation of ensemble learning, which seeks better predictive performance by combining the predictions from multiple models, we regard each intermediate layer as an independent model (or decision maker).To make each layer a better decision-maker, LaCL injects a sound representation learning signal (i.e., supervised contrastive learning) to the entire layer by training objective function in a layer-agnostic manner to engage every layer more directly.Additionally, we propose correlation regularization loss (CR loss) which decorrelates a pair of strongly correlated adjacent representations to encourage each layer to learn layer-specialized representations from complementary information of each layer.Then, the global compression layer (GCL) implicitly assembles various features in each layer into a single calibrated ensemble representation .In the following subsections, we explain the components of our model in detail.

Supervised Contrastive Learning
Supervised contrastive learning (SCL) is a supervised variant of vanilla contrastive learning, which employs label information of the input to group samples into known classes more tightly.Thus, SCL can learn data-label relationships as well as data-data relationships as in CL. In where P (i) = {p ∈ B : ȳj = ȳi } is the set of indices of all positives in the augmented batch with query index i and τ represents temperature hyperparameter.

Global Compression Layer
The global compression layer (GCL) is a two-layer MLP that is directly connected to entire layers to assemble intermediate representations into a single representation z.GCL can be viewed as a particular type of projection head in contrastive learning.By linking the projection head to the entire layer, GCL facilitates layer-agnostic training to engage every middle layer in a training objective directly.
The process of extracting the final latent vector z with GCL is as follows: (The batch index term b is omitted for brevity from now.) First, each layer l (l ∈ |L|, where |L| refers to the cardinality of the layers) in PLM, outputs token embeddings ] for sentence x.Then we combine token embeddings H l into a single vector h l = pool(H l ) by applying the pooling function (i.e., mean pooling).Lastly, GCL receives the pooled token embedding of each layer h l (where, h l ∈ R |D| ) as an input and outputs compact low-dimensional representation c l (where, c l ∈ R |D|/|L| ).And we concatenate all compact representations c l to generate a single sentence representation z from x: where ⊕ indicates concatenation and z ∈ R |D| .LaCL trains the SCL loss with the final representation from GCL z, which inheres information from entire layers.

Correlation Regularization Loss
The correlation regularization (CR) loss restrains a pair of features from each adjacent layer from being similar, following the intuition of an ensemble where its performance boost springs from various decisions (Kuncheva and Whitaker, 2003;Gashler et al., 2008).Specifically, it encourages adjacent layers to activate different dimensions given the same input.First, we define the correlation in the dimension d of the adjacent layer (l and l + 1) as follows: Then, the CR loss selects a strongly correlated dimension set S by picking the dimensions that exceed the pre-set margin value m and decorrelates set S iterating over every adjacent layer: Finally, the overall loss term for LaCL can be described as follows: where λ 1 denote weights for CR loss.

Classification & OOD Scoring
Since there is no task-specific final layer (i.e., classification layer for cross-entropy loss) in LaCL, classification and anomaly detection are conducted via a cosine similarity scoring function (Tack et al., 2020).Employing the cosine similarity scoring function in LaCL is straightforward and shows good compatibility, as the model trained with contrastive learning can measure meaningful cosine similarity between data instances.
For input x, we first extract the implicit ensemble representation z(x) and find the nearest neighbor instance x nn , i.e., max nn sim(z(x), z(x nn )), from the training dataset.Then we classify label of x as the label of the nearest neighbor y nn .And for the OOD detection, we use the similarity between input and its nearest neighbor as follows: Finally, we decide whether the input x is outlier or not through following the binary decision function I δ : where δ denotes anomaly threshold, usually obtained from a score of the training instance which is in the boundary of the pre-set true positive rate.

Augmentation for Contrastive Learning
Augmentation is a crucial factor in CL that directly influence the model performance.To find the most effective data augmentation for OOD, we carefully select six data augmentation tactics for contrastive learning: back-translation (BT) (Li et al., 2021), dropout (DO) (Gao et al., 2021), token cutoff (Yan et al., 2021;Shen et al., 2020), random span masking (RSM) (Liu et al., 2021), and token shuffling (Lee et al., 2020).As our final data augmentation tactics, we greedily combined two best-performing augmentations, i.e., BT and RSM.
Instance 1 (t 1 ): raw data + RSM + DO Instance 2 (t 2 ): BT + RSM + DO Note that DO is always applied by default unless the dropout probability is specified to 0 manually since it utilizes a dropout layer inside the transformer (Vaswani et al., 2017).We explain each augmentation and report their performance in the Appendix A.

Implementation Details
In the following experiments, we adopt BERT-base (Devlin et al., 2019) as a backbone of our network.We fixed the dimension of the first layer in GCL to 1024 and the dimension of the second layer to 64 = 768/(num_layers) so that the dimension of the concatenated vector z is 768 (BERT-base embedding dimension).We used mean pooling as a token embedding pooling function, set temperature τ to 0.05, CR loss weight λ 2 to 1, and margin m in CR loss to 0.5.Moreover, we set the batch size to 128 and used AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate 1e-5 with a cosine annealing scheduler.

Dataset.
We utilized CLINC150 (Larson et al., 2019), Bank-ing77 (Casanueva et al., 2020), and Snips (Coucke et al., 2018) datasets for our experiments, which are commonly used in OOD detection literature.(The Appendix B covers statistics, description, and the detailed rationale behind our dataset selection.)Utilizing the selected dataset, we measure OOD performance in 9 different scenarios that can be categorized into the following two settings that are widely used in OOD detection: • Close-OOD setting (spliting dataset) refers to a setting when the test distribution (OOD distribution) is close to the train distribution.Usually, close-OOD setting is simulated by partitioning one dataset into 2 disjoint datasets (i.e., IND / OOD dataset) based on the class label.Since the IND and OOD datasets originated from the equivalent dataset, they share similar distributions and properties, making the task more demanding.In our experiments, we randomly partitioned the class labels in each dataset with three different ratios (25%, 50%, and 75%), following the validation sets-up in previous works (Shu et al., 2017;Fei and Liu, 2016;Lin and Xu, 2019).
• Far-OOD setting (distinct dataset) refers to a setting when the test distribution (OOD distribution) is far from the IND train distribution.So far-OOD is relatively easy to discern test samples from the normal distribution.Usually, far-OOD setting is simulated by regarding the disjoint dataset as a test dataset (OOD dataset).i.e., CLINC150 (IND) → Banking77 (OOD) or Snips (OOD).In some scenarios, we verified that some intents belong to both IND and OOD, so we manually removed overlapping intents before training.(Details about removed intents in each scenario are in the Appendix B.2) We also categorize CLINC150 (OOD) → CLINC150 OOD split (OOD)2 as far-OOD, since previous work (Zhang et al., 2022) manually confirmed that the distribution of CLINC OOD split is highly unrelated to CLINC train split.

Metrics.
To evaluate IND performance, we measured the classification accuracy.And for OOD metrics, we adopt two metrics that are commonly used in recent OOD detection literature: • FPR@95.The false-positive rate at the truepositive rate of 95% (FPR@95) measures the probability of classifying OOD input as IND input when the true-positive rate is 95%.• AUROC.The area under the receiver operating characteristic curve (AUROC) is a threshold-free metric that indicates the ability of the model to discriminate outliers from IND samples.

Competing Methods
Recent OOD detection methods can be divided into scoring function and model training methods.We compare LaCL with their combinations to investigate the effectiveness in a holistic view.
Scoring functions: • Mahalanobis distance discerns abnormal input via class-wise density estimation assuming the representation follows the multivariate normal distributions (Lee et al., 2018b).It is a multi-dimensional generalization of quantifying how many standard deviations away from the mean of the distribution.
We also cover the explicit ensemble of the Mahalanobis (Shen et al., 2021), which is a simple aggregation of the Mahalanobis distance (D) of intermediate representations: (8) Notably, they place the nonlinear tanh layer to map the features of each transformer layer.
• Cosine similarity determines outliers by utilizing the similarity between the nearest neighbor of the known instance (usually from the training dataset) and the inferring input.Sec.3.5 elaborates the details of the cosine similarity scoring function.We also cover an explicit ensemble version of the cosine scoring function, which determines OOD with an aggregation of cosine similarity of intermediate representations analogous to Eq. 8 but without tanh function in the last term.Training methods: We set a cross-entropy loss trained model and a sigmoid based 1-vs-rest classifier (Shu et al., 2017) as a baseline model.Additionally, we compare our method with 6 recent CL based methods (Gao et al., 2021;Liu et al., 2021;Yan et al., 2021;Li et al., 2021;Zhang et al., 2022;Zhou et al., 2021) For unsupervised CL methods, we train them with the cross-entropy loss additionally to give a signal about training distribution as in OOD specific frameworks.We extract the mean-pooled representation of the last layer features for all methods and pass it to a scoring function.On the other hand, LaCL exploits an implicit ensemble representation z from GCL.

Main Results
This section reports the performance of LaCL with other competing methods in two different settings.in three far-OOD scenarios.(Performance report with the remaining ratios, i.e., 25%, and 75%, are in the Appendix C.) We report the average and standard deviation of 5 trials as a model performance for reproducibility.
From the results, we verified that LaCL with a cosine scoring (single) function consistently surpasses other methods significantly.We also confirmed that most methods (excluding LaCL) exhibit better performance with the explicit ensemble methods, indicating the potential of intermediate representations in OOD detection, as suggested in past studies (Shen et al., 2021;Sastry and Oore, 2020;Lee et al., 2018b).However, the performance of LaCL degrades with the explicit ensemble evaluations, proving that our ensemble method can gather more distinctive and calibrated information from entire layers than the naïve aggregation, and the explicit ensemble only acts as noise.It is also worth noticing that LaCL shows good compatibility with cosine evaluation than the Mahalanobis evaluation since the Mahalanobis evaluation assumes that the extracted representations follow a Gaussian distribution.The following condition holds when the model is trained with cross-entropy loss, as they can be viewed as a generative classifier (Lee et al., 2018b).However, LaCL does not utilize cross-entropy loss, and the mentioned assumption is hardly met.Lastly, cosine ensemble evaluation tends to perform better than the Mahalanobis ensemble (Shen et al., 2021) counterpart in general.We conjecture that aggregating each result into a single one is more difficult in the Mahalanobis ensemble, as the Mahalanobis distance is not a normalized score (ranging −∞ to ∞) while cosine is normalized (ranging -1 to 1).To conclude, we demonstrate that our model can extract elaborate ensemble representation, which yields the highest performance in various scenarios without multiple computations of the scoring function.

Analysis
In this section, we conduct supplementary experiments on LaCL to analyze our framework in-depth to elucidate its behavior.

Layer-wise Performance
Although our model outperforms other methods, it is unclear whether LaCL can well-assemble the information in the intermediate representations, analogous to our initial intuition.In an attempt to give answer this question, we scrutinize the layerwise performance of LaCL and the baseline model.Fig. 3 summarizes the layer-wise AUROC score of LaCL and baseline in far-OOD and close-OOD settings.While higher layers tend to exhibit better performance, it is not always the case.Speaking otherwise, the last layer does not always guarantee the best performance among the upper layers.In this situation, the explicit ensemble of the baseline model conditionally shows performance gain.Namely, in a far-OOD setting (Fig. 3a), the ensemble representation displays substantial performance gain.In contrast, in a close-OOD setting (Fig. 3b), the ensemble representation often yields worse performance than the best-performing single layer.On the other hand, LaCL displays the best performances among other layers unconditionally, proving the capability of LaCL to absorb layerspecialized information of the entire layers properly.

Ablation study
We present ablations on LaCL to give intuition behind its behavior and justify our design choices.Module ablations.We alter our model in several ways by removing some components in LaCL to test their independent impact.Tab. 3 summarizes component-wise ablations of our model in Banking 50% split setting, which is the harshest condition (lowest performance) from Tab. 2, 1.While our layer agnostic training (GCL) or regularization term (CR loss) does not statistically contribute to the accuracy compared to applying SCL alone, they substantially improve OOD performance.LaCL variants.From previous experiments (Sec.5.1), we verified that the higher layers tend to yield better performance than the lower layers.So it is a reasonable conjecture that assembling only the upper layers may render better performance, assuming there is no meaningful information in the lower layers.Founded on this observation, we introduce two variants of LaCL: First variant (variant 1 in Tab. 3) utilize upper half layers z * in the inference: Furthermore, the second variant (variant 2) also utilizes the upper half layers z * ; however, they disconnect the lower half layers with GCL when they train the model.To our surprise, we verified that LaCL outperforms the other two variants, indicating that the features from lower layers retain considerable meaningful information, regardless of their performance.

Distribution Visualization
In this section, we plot a histogram of our model and baseline model to visualize how each model   forms the IND and OOD distribution.Fig. 4 illustrates the histogram with the cosine scoring function of LaCL and the baseline model trained on Banking split 50% setting.We regard inputs as OOD when the input score is lower than the threshold δ, where δ is a preset threshold when TPR is at 95%, as stipulated in FPR-95%.To our surprise, both models have the ability to discriminate INDwrong (yellow line) from IND-right answer (blue line), meaning can output high uncertainty for inputs that are likely to be wrong.On the other hand, LaCL forms a much clearer decision boundary and measures more precisely predictive uncertainty for OOD inputs (green line).

Case Study
We also scrutinized a case study on misclassified OOD inputs to identify the shortcomings and limitations of our model.Tab. 4 summarizes some OOD inputs which LaCL misclassified as normal input along with their IND prediction class.In most cases, they include keywords or phrases that are highly relevant to the wrongly predicted intent, meaning they tend to learn some shortcuts (Geirhos et al., 2020)

Limitations
Currently, our model concatenates compressed representations to gather information from entire layers.Thereby, if the number of layers changes depending on the backbone, hyper-parameters of LaCL need to be manually optimized.Additionally, our methodology is a general-purpose methodology that can be applied to other tasks as well as OOD detection, but its utility has not been explored in other tasks.For future work, we will explore the compatibility of our framework to other tasks or areas (e.g., computer vision) and devise an approach to optimize the aforementioned hyper-parameters in an automated fashion.
Back-Translation is a method of translating a raw sentence into another language and then re-translating it back into the same language.Precisely, we translate raw sentence into german and re-translate it back into english utilizing 'transformer.wmt19.en-de.single_model','transformer.wmt19.de-en.single_model'from fairseq (Ott et al., 2019).In order to avoid the BT sentence from being completely identical to the original sentence, we generated top-5 sentences and sampled from them after checking the duplicates.
Dropout (Gao et al., 2021) utilize dropout layers in transformers (Vaswani et al., 2017) to extract stochastically different representation.Due to dropout layer, giving the same input to the same model yields slight different representation and dropout utilize those inputs as a augmentation.Note that, dropout is always applied by default.
Random Span Masking (RSM) first randomly select some span, i.e.,k continuous characters, in the input sequence.Then, they randomly replaced with [MASK] token.In general, RSM is apply in one instance of the two augmented instances, as it was proposed in the original MirrorBERT paper (Liu et al., 2021) In this paper, we additionally consider applying it on both side of a pair.
Token Shuffling Token shuffling randomly shuffles the order of the input tokens (positional embedding) in the input sequence.
Token Cutoff In token cutoff is a simple strategy that eliminates some input tokens randomly.
We investigate the effectiveness of the beforementioned augmentations in OOD detection to select our final data augmentation combination.Tab.A summarizes the results in Banking split 50% setting.For our final data augmentation, we greedily combined two best performing augmentations, i.e., BT and RSM, which showed best performance in OOD metrics.

B.1 Dataset Selection and Details
In order to investigate the performance of our model in many different situations, we conduct experiments on intention classification datasets.Generally, intention classification classes are organized hierarchically, which often consist of domains (e.g., banking, travel, reservation) and intents (e.g., banking -transfer money, banking -check account) where one domain serve as a parent category of multiple intents.It is much demanding to distinguish unknown intent under equivalent domain than discerning unseen domain (Zhang et al., 2022), as the domain is a high-level concept.Considering the facts mentioned above, we carefully selected CLINC150 (Larson et al., 2019), Bank-ing77 (Casanueva et al., 2020), and Snips (Coucke et al., 2018) datasets each comprises of distinct class hierarchy.
Specifically, CLINC150 dataset contains various domains and intents, so it is a favorable dataset to measure overall model performance.In the case of the Banking77 dataset, it consists of fine-grained 77 intents under a single banking domain.On the other hand, the Snips dataset comprises seven different domains, making each class relatively easy to discern.(See Tab. 5 for statistics about each dataset.)

B.2 Overlapping Intents
In far-OOD setting, we train the model with CLINC dataset and test with Snips or Banking dataset.While each dataset includes a variety of domains, however, there is a potential overlap in between each datasets.We manually compared domains and intents in each dataset and removed overlapping classes, as there should be no overlap of domain between OOD test set and the train set.Specifically, banking and credit_cards domains in CLINC150 is similar to banking77, so we removed mentioned domain from CLINC before we train the model.Likewise, Snips also includes some intents that also occur in CLINC150, which are summarized in Tab 6.We removed 7 intents in CLINC dataset in Tab.6 when we utilize Snips dataset as an OOD dataset.

C Additional Experiments
C.1 Split setting with remaining ratios In this experiment, we exhibit remaining experiments in split setting (ratio 25%, 75%) with the BERT backbone to the Table .The results resembles the tendency to the split ratio 50% in the main paper experiments.LaCL outperform other models significantly.

D.1 Case study
In this section we elaborate detailed case study on LaCL trained and tested on CLINC150 setting.Following previous experiment in the paper, we regard inputs as OOD when the the input cosine score is lower than the threshold δ, where δ is a preset threshold when TPR is at 95%, as stipulated in FPR-95%.
To further investigate our error cases, we categorized the error cases into two classes: OOD inputs, misclassified as IND, and IND inputs, misclassified as OOD.OOD inputs, misclassified as IND occurs when LaCL predicts a high confidence for OOD input.
Example cases for this error are shown in Tab. 9.There are many controversial variation in this error in this case; however, they contain keywords or phrases that are highly relevant to the wrongly predicted IND intent, meaning they tend to learn some shortcuts (Geirhos et al., 2020) from the train set as mentioned in the paper.The fact that fine-tuned classifier learns some shortcuts from the train set is a well-known problem, and there are previous works (Moon et al., 2021).As a side note, few data were mislabeled, as can be seen in Tab. 9. IND inputs, misclassified as OOD occurs when LaCL predicts a low confidence for IND input.Example cases for this error are summarized in Tab.10.We sort the error cases into 3 groups: Misspelled words (typos), Nonstandard words (e.g., acronyms, slangs), and absence of keywords.
The examples in first error case occurs when the words heavily related to the intent are misspelled.Interesting part is that even though the model assigns low score to this type of errors, it predicts the true intent correctly.The second error type happens when nonstandard words appear.We believe that these errors are caused by the PLM not having semantics for those abnormal tokens.Lastly, final error case arises when the intent-specific words are absent in the sentence.Namely, LaCL suffers when the input sentence comprises the words that are not commonly used, although their semantic is roughly the same.This phenomenon is another example of the prominence of keyword over reliance.However, learning shortcut is natural phenomenon surmising the following example: The train data in the 'text' intent, the word text appears in 96 out of 100 sentences.As a side note, few data were mislabeled similar to previous error cases.
SCL, each batch B = {(x b , y b )} |B| b=1 in the dataset, where x b , y b denotes a sentence and a label for index b respectively, generates an augmented batch B = {( xb , ȳb )} | B| b=1 , where labels of augmented views are preserved as the original one.The augmented batch B consists of two augmented input; x2b−1 = t 1 (x b ) and x2b = t 2 (x b ), where t 1 , t 2 indicate data augmentation functions specified in Section 3.6.Then, ( x2b−1 , x2b ) are passed through PLM and projector, generating latent vectors (z 2b−1 , z 2b ) that are utilized to calculate the supervised contrastive loss:

Figure 2 :
Figure 2: Overall structure of Layer-agnostic Contrastive Learning (LaCL).The global compression layer trains the SCL loss in a layer-agnostic manner by engaging entire layers in the CL task.And the correlation regularization (CR) loss decorrelates each intermediate layer to avoid ovelapping information between each layer.
) where d indicates the index of hidden embedding dimension (d ∈ |D|/|L|, where c l ∈ R |D|/|L| ) and b refers to a data index of the augmented batch B.
Figure 3: Layer-wise AUROC score of baseline and LaCL with cosine scoring function.Explicit ensemble (baseline) tends to work well in relatively easy setting (far-OOD), while it yields worse performance than best performing single representation in harsh conditions (close-OOD).Implicit ensemble representation from LaCL outperforms other layers consistently.

Figure 4 :
Figure 4: Histogram of LaCL and Baseline model trained on Banking split 50% setting.

Table 2 :
IND / OOD performance of each model 3 far-OOD settings.The best performance in each method is indicated in bold and the global best is underlined.

Table 3 :
Ablation study on LaCL components and its variations on Banking split setting .

Table 4 :
OOD Input Text PredictionWhere can i find cheap rental skis nearby car_rental Search up someone who plays in a movie play_music What oil is best for chicken oil_change_how Read text text What is harry's real name change_user_name Check battery health on this device jump_start Who invented the internet who_made_you Examples of OOD samples misclassified as IND.The keywords that cause over-reliance are in bold.
instead of capturing holistic context.Another notable observation is that LaCL is fragile to typos and non-standard language (e.g., acronyms, slangs).More thorough explorations are in Appendix D.1.

Table 6 :
Overlapping classes between CLINC and Snips dataset.

Table 7 :
Data augmentaion results in Banking split 50% setting.