Multi-Level Knowledge Distillation for Out-of-Distribution Detection in Text

Self-supervised representation learning has proved to be a valuable component for out-of-distribution (OoD) detection with only the texts of in-distribution (ID) examples. These approaches either train a language model from scratch or fine-tune a pre-trained language model using ID examples, and then take the perplexity output by the language model as OoD scores. In this paper, we analyze the complementary characteristic of both methods and propose a multi-level knowledge distillation approach that integrates their strengths while mitigating their limitations. Specifically, we use a fine-tuned model as the teacher to teach a randomly initialized student model on the ID examples. Besides the prediction layer distillation, we present a similarity-based intermediate layer distillation method to thoroughly explore the representation space of the teacher model. In this way, the learned student can better represent the ID data manifold while gaining a stronger ability to map OoD examples outside the ID data manifold with the regularization inherited from pre-training. Besides, the student model sees only ID examples during parameter learning, further promoting more distinguishable features for OoD detection. We conduct extensive experiments over multiple benchmark datasets, i.e., CLINC150, SST, ROSTD, 20 NewsGroups, and AG News; showing that the proposed method yields new state-of-the-art performance. We also explore its application as an AIGC detector to distinguish answers generated by ChatGPT and human experts. It is observed that our model exceeds human evaluators in the pair-expert task on the Human ChatGPT Comparison Corpus.


Introduction
Machine learning systems such as dialog agents are widely used in many real-world applications. 1 Our code is available at https://github.com/ microsoft/KC/tree/main/papers/MLKD_OOD. These systems have proved to work well when the distributions of training data and test data are the same or closely similar. However, when there is a gap between training distribution and test distribution, trained models may generate dubious, and even disastrous, predictions that could cause serious AI safety issues (Hendrycks and Gimpel, 2017). Therefore, it is crucial to detect out-of-distribution (OoD) inputs for deployed machine learning systems. Moreover, lifelong learning systems are usually required to discover OoD examples during their application to create new tasks and learn them incrementally (Liu and Mazumder, 2021), which further highlights the importance of OoD detection. In this paper, we focus on the task of OoD detection with only in-distribution texts available during learning for its capability of dealing with diverse scenarios such as non-classification applications while requiring the least data collection effort. Recent studies have well demonstrated the validity of self-supervised representation learning (Manolache et al., 2021;Arora et al., 2021;Mai et al., 2022). These approaches use ID examples to either fine-tune a large pre-trained language model (M finetune ) (Manolache et al., 2021;Mai et al., 2022) or to train a language model from scratch (Mai et al., 2022) (M fromScratch ). Given an input sequence for inference, token perplexity output by the learned/fine-tuned language model is regarded as the OoD score, i.e., indication of an example being OoD. However, both methods have limitations. For M finetune , since the pre-training corpus usually consists of huge-scale datasets from a diverse range of genres, it is possible that some OoD examples are seen during pre-training, leading to a risk of non-distinguishing perplexities between ID examples and these "leaked" OoD examples as shown in Figure  Inspired by Ma et al. (2022), which indicates that unsupervisedly trained sentence embeddings (mean pooling over all token representations) (Giorgi et al., 2021) can achieve non-trivial performance in the sentence classification task, here we contemplate that the pre-training procedure of language models can facilitate their ability of capturing semantic relatedness. In other words, language models are promoted to map examples with different semantics to different manifolds via pre-training. Therefore, we suggest inheriting the representation space with such characteristics gained from pretraining to mitigate the limitation of M fromScratch .
In this paper, we propose to adopt multi-level knowledge distillation to integrate the strengths from both methods while mitigating their limitations. Specifically, we first produce a teacher model by fine-tuning a large pre-trained language model with ID training examples, so that features of the teacher model can well represent the ID data manifold, while to some extent preserving the ability to map examples with different semantics to different manifolds. Then, we perform knowledge distillation to learn a student model from scratch, using ID training examples with supervision from the fine-tuned teacher model. To learn the teacher's representation space more thoroughly, we not only perform prediction layer distillation, but also propose a similarity-based intermediate layer distillation method to make the student model aware of the information flow inside the teacher's layers. Finally, we deploy the learned student model to compute token perplexity for each inference example as its OoD score and compare it with a threshold to determine whether it is OoD or not. In contrast to M finetune , our student model doesn't see any OoD examples during parameter learning, thus avoiding the leakage of OoD examples. Compared with M fromScratch , our student model is trained with the regularization inherited from pre-training via the multi-level supervision from the teacher model, thus gaining a stronger ability to map OoD examples outside the ID data manifold. Both are conducive to more distinguishable representations for OoD detection.
Moreover, with the development of automatic text generation technologies such as InstructGPT (Ouyang et al., 2022) and ChatGPT 3 , the risk of automatically generated content to society (e.g., generating fake news or fake reviews of products) is increasing. Therefore, we further adapt our model to distinguish texts generated by AI models and human experts. By conducting experiments on the Human ChatGPT Comparison Corpus (HC3), we observe that our model beats human evaluators and shows excellent capability in the pair-expert task.
Our major contributions can be summarized as: • We analyze the limitations of existing methods for OoD detection with solely ID examples. We investigate their complementary characteristics and propose a novel multi-level knowledge distillation-based approach to unify the strengths of previous studies while mitigating their limitations. To our best knowledge, this is the first attempt to adapt knowledge distillation to textual OoD detection.
• We propose a dynamic intermediate layer distillation method to force the student model to thoroughly explore the representation space of the teacher model. The learned student can well represent the ID data manifold while gaining a stronger ability to map OoD examples outside the ID data manifold.
• We apply our model as an AIGC detector to distinguish automatically generated texts from those generated by human experts. The experimental results show that our model outperforms human evaluators in the pair-expert task on the HC3 benchmark.

Related Work
Considering the accessibility of OoD data and class labels of ID data, previous work for OoD detection can be divided into three categories: i) OoD data available; ii) OoD data unavailable, but class labels of ID examples available; and iii) both types of data unavailable.
Methods with supervision from OoD data. These methods usually train a binary classifier (Larson et al., 2019) or a multi-class classifier (Hendrycks et al., 2019;Zhan et al., 2021) to detect OoD examples, where OoD data is regarded as an independent class for training. OoD data used as supervision is collected from other existing datasets that are disjoint with the ID training data (Hendrycks and Gimpel, 2017). Some previous work also introduces synthesized pseudo outliers to try to find a more representative classification hyperplane for OoD data (Zhan et al., 2021). Since there are various reasons for an example to be considered OoD, e.g., being out-of-domain (Daumé III, 2007), infrequent (Sagawa et al., 2020), or adversarial (Carlini and Wagner, 2017;Arora et al., 2021), it is impractical to collect OoD data for learning.
Methods without supervision from OoD data but with supervision from ID class labels. These approaches generally consider the scenario of multiclass classification such as intent detection (Lin and Xu, 2019; Yilmaz and Toraman, 2020) and assume that class labels of in-distribution (ID) data are available during model training. Class probabilities (Hendrycks and Gimpel, 2017;Shu et al., 2017;Liang et al., 2018;Zeng et al., 2021b;Zhou et al., 2021) and distance or density in latent space (Lin and Xu, 2019;Xu et al., 2020;Podolskiy et al., 2021;Zeng et al., 2021a;Zhou et al., 2021) are the most prevalent metrics. In particular, Hendrycks and Gimpel (2017) propose a strong baseline which takes the maximum softmax probability (MSP) of a multi-class classifier as a measure of OoD score. Based on that, lots of following studies devote to optimizing the model's calibration with temperature scaling (Liang et al., 2018), contrastive learning (Zeng et al., 2021b;Zhou et al., 2021), etc. For distance and density based approaches, they first learn discriminative deep features via carefully designed loss functions, e.g., large margin cosine loss (Lin and Xu, 2019;Xu et al., 2020) and contrastive loss (Zeng et al., 2021a,b;Zhou et al., 2021). Then, compute distance or density metrics such as local outlier factor (LOF) (Breunig et al., 2000;Lin and Xu, 2019;Zeng et al., 2021b) and Gaussian discriminant analysis (Xu et al., 2020;Podolskiy et al., 2021;Zeng et al., 2021a,b) to detect OoD examples.
Methods without supervision from both OoD data nor ID class labels. Given the indistribution data, these methods generally estimate ID density and regard test examples that deviate from the estimated distribution as OoD examples. Previous work for this setting mainly focuses in the field of computer vision. Variational autoencoders (VAE) (Kingma and Welling, 2014) and generative adversarial networks (GAN) are frequently taken as the backbone models for density estimation (Chen et al., 2018;Zenati et al., 2018). In natural language processing, current studies generally perform selfsupervised language modeling on ID examples and take token perplexity as OoD score (Arora et al., 2021;Manolache et al., 2021;Mai et al., 2022). Gangal et al. (2020) further introduces an independent background model to correct confounding background statistics. Moreover, Xu et al. (2021) learn a combination of latent representations from different layers of pre-trained transformers to represent ID data manifold in a compact way. Based on that, one-class classification methods such as one-class SVM (Schölkopf et al., 2001)   limitations. To our best knowledge, this is the first attempt to adapt knowledge distillation to textual OoD detection. Compared with Xu et al. (2021), our approach does not involve any hyper-parameter sensitive one-class classification stage. Compared with Jin et al. (2022), our proposed method requires no prior knowledge about the ID data, e.g., the number of semantic categories. Moreover, our approach is orthogonal to Gangal et al. (2020) and we can combine both to achieve better performance.

Methodology
In this section, we elaborate on the proposed multilevel knowledge distillation approach for OoD detection. First, we clarify how to produce a teacher model that estimates the distribution of ID data. Then, we describe the proposed multi-level knowledge distillation procedure to teach a randomly initialized student model via the produced teacher model. Note that in this paper, both the teacher and student networks are built with Transformer layers (Vaswani et al., 2017). Figure 2 illustrates the overall framework.

Teacher Model
Here, we use language models as the base model to estimate the distribution of ID data. Causal language modeling (CLM) and masked language modeling (MLM) are the most representative language models. CLM predicts the next token based on unidirectional contexts, while MLM first masks some tokens and then predicts the masked tokens conditioned on bidirectional context. Since the nature of MLM requires the model to forward multiple times so that the probability of each token in the sentence could be predicted, it is time-consuming to exploit MLM to estimate ID data distribution. Therefore, we utilize CLM in our approach.
Given a text sequence , where x i is the i-th token and N is the sequence length, the probability estimation function of CLM can be formulated as: where x <i denotes tokens before the i-th token x i . In this paper, we fine-tune a large pre-trained language model on the ID training examples to produce the teacher model. The loss function w.r.t.
x is: where θ tea represents the parameters of the teacher model.

Knowledge Distillation
With the supervision from this teacher model, we then train a student model by performing both prediction layer distillation and intermediate layer distillation.

Prediction Layer Distillation
Given a training ID sequence x = {x i } N i=1 ∈ D in , the learning loss for prediction layer distillation w.r.t. x is formulated as the Kullback-Leibler divergence between the output probability distributions over the vocabulary V output by the teacher model and by the student model. Averaging over all tokens, we have: where x i represents the i-th token in x, p(x i |x <i ; θ tea ) denotes the probability distribution for the i-th token output by the teacher model, and p(x i |x <i ; θ stu ) represents that of the student model.

Intermediate Layer Distillation
Considering that different layers in large pretrained language models generally correspond to features at various abstraction levels (Jawahar et al., 2019;Caucheteux et al., 2021), here we propose an intermediate layer distillation method to facilitate the student model acquiring a more comprehensive awareness of the information flow inside the teacher's layers. Instead of pre-defining a fixed mapping function between teacher layers and student layers, we dynamically match each hidden vector of the student to multiple hidden vectors of different layers of the teacher. Specifically, we first use ℓ 2 distance to measure the similarity between the hidden vector produced by the student model w.r.t. the i-th token at the l-th layer (i.e., h stu l,i ) and that produced by the teacher model w.r.t. the i-th token at the j-th layer (i.e., h tea j,i ) : where j ∈ A, A represents the set of the teacher's layer indexes, and W j are learnable parameters. Let S K l,i = {s k l,i (·)} K k=1 denote the top-K similarities computed by Eq. (4) w.r.t. h stu l,i . We then train the student model by maximizing the similarities in S K l,i . Let β k denote the to-be-learned weighting scalar corresponding to the k-th similarity in S K l,i . The learning loss at the l-th layer w.r.t. x can be formulated as: Finally, we integrate the prediction layer distillation and the intermediate layer distillation. Let T denote the set of the student's layer indexes, the whole training loss of the student model is the summation of losses w.r.t. all sentences in D in : where λ is a hyper-parameter for weighting.

Inference
For inference, we only use the learned student model θ stu to compute perplexity for each token x i in an input sequence x = {x i } N i=1 . We calculate the OoD score w.r.t. x by averaging over all tokens: We define a threshold γ to classify OoD examples against ID examples. Specifically, x is predicted as an OoD example if score(x) > γ, else it is an ID example.  (Gangal et al., 2020), 20NewsGroups (Lang, 1995), and AG-News (Zhang et al., 2015). For dataset statistics and other detailed information, please refer to Appendix A.1. SST ROSTD   ). We compare each student layer to a combination of K teacher layers, as Haidar et al. (2022) show that concatenated representation distillation of sorted randomly selected K intermediate layers is superior to layer-wise distillation. We choose K = 2 for the cardinality of the similar set S K l,i considering that there's no information fusion among different teacher layers if K = 1 and that a larger K may introduce too much noise due to the weighted average of representations as in Equation (5). We re-implement Xu et al. (2021) and Manolache et al. (2021) with BERT-base using their opensourced code, and report the results on all benchmark datasets for a more comprehensive comparison. All experiments are conducted on one Tesla V100 (16GB). The trainable parameters (i.e., θ tea and θ stu ) are 248M. The training time is about 30 minutes for each model.

Main Results
Tables 1 and 2 report the results of our approach alongside those reported by previous state-of-theart methods on CLINC150, SST, ROSTD, and 20NewsGroups. It can be seen that our proposed method outperforms the prior methods with a large margin in most experiments, achieving an improvement of up to 9.13, 20.73, 38.71 points in terms of AUROC, AUPR, and FAR95, respectively, on CLINC150. This well demonstrates the effectiveness of the proposed approach.
The results also show that M fromScrath generally leads to superior performance than M finetune . We conjecture that seeing no OoD examples during parameter learning helps the randomly initialized model avoid optimizing toward OoD distribution. Without the bias and constraints inherited from the pre-training process, the model trained from scratch is more likely to find a local minimum that better fits the ID training text and thus leads to more distinguishable features for OoD detection. Moreover, our approach, which uses the fine-tuned model to teach a randomly initialized model, can integrate their strengths via the proposed multilevel knowledge distillation process, resulting in superior performance.

Ablation Study
To validate the contributions of different components in the proposed approach, here we introduce two variants of our model for ablation study: i) Ours w/ GPT2_Init_θ stu , which initializes the student model with the pre-trained GPT-2 model. ii) Ours w/o L x (l) , which eliminates the loss w.r.t. intermediate layer distillation and only conducts output layer distillation to learn the student model. Table  3 shows the results.
Comparing Ours with Ours w/ GPT2_Init_θ stu , we can see that replacing the randomly initialized  student model with a pre-trained student model will cause a significant performance drop, well verifying our motivation to incorporate M fromScratch with M finetune . Table 3 also illustrates that removing the constraints on intermediate layers, i.e., L (x) (l) , the student model's performance will decrease by 0.90, 0.87, and 5.03 in terms of AUROC, AUPR, and FAR95, respectively. This well validates both the effectiveness and necessity of intermediate layer distillation. Moreover, though eliminating the intermediate distillation, the student model in Ours -L x (l) which is derived with only the prediction layer distillation still outperforms the baseline model M fromScratch . We owe this superiority to the more informative supervision, i.e., the probability distribution produced by the teacher model, compared with the ground-truth one-hot supervision used in M fromScratch .

Analysis on Distribution of Sentence Repr.
To bring up insights on how multi-level knowledge distillation promotes OoD detection, we utilize t-SNE (Van der Maaten and Hinton, 2008) to reduce the dimension of sentence representations obtained from pre-trained GPT-2, M finetune , M fromScratch , and the student model in our approach. Here, we produce sentence representations by averaging token representations. The visualization is shown in Figure 3. In Figure (

Application
ChatGPT, an optimized language model for dialog 6 , has attracted great attention in the NLP field since its inception. It is capable of providing fluent and comprehensive responses to a large variety of questions. To study how close ChatGPT is to human experts, Guo et al. (2023) proposed the Human ChatGPT Comparison Corpus (HC3), where each question is paired with two answers, one is a human answer collected from wiki sources and public QA datasets, and the other is generated by ChatGPT 7 . By conducting human evaluation, Guo et al. (2023) indicates that it can be difficult to distinguish texts generated by ChatGPT from those provided by human experts, and further propose a RoBERTa (Liu et al., 2019) based detector to distinguish both.
Following Guo et al. (2023), in this section, we adapt our model as an AI-generated content (AIGC) detector to explore its capability for preventing the potential risks of AIGC abuse. As our model uses perplexity as the OoD score and Guo et al. (2023) reveal that ChatGPT-generated answers are usually of low perplexities, here we take ChatGPTgenerated answers as in-distribution data to train our model. We divide the in-distribution data into a training set and a test set. We use all the humangenerated answers as the OoD test set.
We first evaluate our model as in 4.1 and Table 4 shows its performance results. We can see that our approach significantly outperforms prior state-of-the-art methods DATE and MDF+IMLM under the same settings. Surprisingly, our unsupervised method demonstrates comparable performance with RoBERTa-single Detector, which is a RoBERTa-based sentence classifier trained with the supervision from all the ChatGPT-generated and human-generated texts.  We also compare our model to the human evaluation results listed in Guo et al. (2023). Given two answers corresponding to the same question, with one being generated by ChatGPT and the other by a human expert, our model is required to determine which answer is generated by ChatGPT. Table 5 shows that our model beats human evaluators and perfectly handles this task.

Conclusion
In this paper, we focus on the setting of OoD detection without supervision from both OoD data nor ID class labels. We analyze the complementary characteristics of existing self-supervised representation learning-based methods and propose a multi-level knowledge distillation approach to integrate their strengths, while mitigating their limitations. We evaluate the proposed method on multiple datasets and results show that the proposed method yields new state-of-the-art performance. We analyze why our approach attains superior performance by conducting ablation studies and sentence representation visualization. We further apply our model as an AIGC detector to distinguish ChatGPT-generated texts from those generated by human experts and the experimental results demonstrate that our model outperforms human evaluators in the setting of paired answers. Table 6 shows the results of our model and other methods on the AGNews benchmark. Interestingly, we notice that our approach reports a slightly inferior performance when compared with MDF+IMLM (Xu et al., 2021). We can see that methods using sentence representations based on token aggregation, e.g., fastText 9 or Glove (Pennington et al., 2014)-based IsoForest, OCSVM, and CVDD (Ruff et al., 2019), as well as BERT based MDF + IMLM (Xu et al., 2021), perform especially well on AGNews compared to their performance on other datasets. We conjecture that this is because AGNews has a much larger variation of sequence length (36.6) than other datasets (around 7 or 8). A larger length variation will lead to more acute fluctuations in perplexities, especially when adopting an autoregressive language model with unidirectional context such as GPT-2-small in this paper, making it more difficult to distinguish between ID and OOD examples than in other datasets. In contrast, sentence representation based methods benefit from directly estimating the OoD score using the information from the whole sentence, thus producing superior performance. Fortunately, the limitation of auto-regressive modeling could be eliminated by leveraging Transcormer (Song et al., 2022) as the base model of our approach, where bidirectional context is used for estimating tokens at each position. We leave this for future work.

A.2 MDF+IMLM with Different Base Models
We take MDF+IMLM from Xu et al. (2021) as one of the baselines. In the main body of this paper, we show the results of MDF+IMLM with BERT as the base model because BERT is the most considered counterpart for GPT-2-small used in our approach. Here we include the RoBERTa-based results of MDF+IMLM from Xu et al. (2021) for your information. Table 8 shows that using a more powerful base model does bring significant performance gain to MDF+IMLM. Though our model is implemented with GPT-2-small, it still demonstrates comparable (on SST) and even superior performance (CLINC150) with RoBERTa based MDF+IMLM.

A.3 Discussion on CLM and MLM.
Here we discuss the consideration for using CLM rather than MLM. In fact, we conducted experiments using the previous method of masking X% of tokens for one forward. However, the results were not satisfactory. We attribute this to an insufficient perplexity estimation in a single forward. In other words, with MLM, it would be better to recover the joint probability of the entire input sequence to   Table 8: Performance comparison on CLINC150 and SST. † represents results reported in Xu et al. (2021). ‡ denotes our re-implemented results.
achieve better performance, i.e., forwarding an input sentence multiple times so that the probability of each token in the sentence could be predicted. This should be time-consuming and thus we use CLM in this paper.