ODIST: Open World Classiﬁcation via Distributionally Shifted Instances

In this work, we address the open-world classiﬁcation problem with a method called ODIST(open world classiﬁcation via distributionally shifted instances). This novel and straightforward method can create out-of-domain instances from the in-domain training examples with the help of a pre-trained language model. Experimental results show that ODIST performs better than state-of-the-art decision boundary ﬁnding method.


Introduction
In the supervised learning setting, it is generally assumed that test set data points will be organized along the same classes observed during training. This assumption, however, proves unreliable in many applications, especially in dynamic and open environments. For instance, Zhang et al. (2021) show that an intent classifier performs rather poorly in a dialogue system when the user expresses intents unobserved in the training dialogues. In an open environment, the ideal classifier should classify incoming data to the correct existing classes that appeared in training and detect those examples that do not belong to any existing classes. Such classifier is thus described as open set recognition (Scheirer et al., 2013) or open world classification .
The existing research to achieve this capability in natural language processing (NLP) and computer vision (CV) mainly focuses on decision boundary finding. Schölkopf et al. (2001); Tax and Duin (2004);   Recent research shows that it is also possible to use deep neural networks to capture advanced features from the data (Lin and . In CV, Bendale and Boult (2016) train a multi-class classifier and take the outputs of the penultimate layer to fit Weibull distribution. Hendrycks and Gimpel (2017) reject the low confidence samples with the threshold based on the probability of softmax distribution. Liang et al. (2018) add a temperature scaling on the softmax function to get a calibrated softmax score.In NLP, Shu et al. (2017) adopt the sigmoid function to learn the one-vs-all classifier and calculate the confidence threshold by fitting training data to Gaussian statistics. Zhang et al. (2021) propose to learn the adaptive decision boundary (ADB). ADB performs best among all the above methods on the open text classification.
Besides adjusting the decision boundary on the feature space learned from in-domain data, a good feature space representing both in-domain and novel out-of-domain (OOD) examples is also essential for novelty detection, namely: open representation learning. We can illustrate this approach with the following NLP example: Let us assume that we have only learned features for "it is red" (for cherries) and "it is yellow" (for bananas) for a fruit classification task. The problem we are trying to overcome manifests when the model is exposed to a blueberry during testing. Since it has not seen the class during training, it does not possess a proper method to extract features for "blue". Ideally, we want a representation learning approach that can compute such a representation instead of using the representation of "red" or "yellow". The straightforward solution is to explore some examples with "blue" during model training, although a blueberry does not exist as a class for in-domain training. However, in real-world applications, we do not foresee the OOD examples that would come in the future. Similarly, for CV, recent work (Tack et al., 2020) augment distributionally shifted images by rotating/flipping the original image and pretrain an image representation space for novelty detection. Inspired by the work, we propose a novel and sim-ple distributionally shifted data creation method for NLP. And then, we train a classifier on in-domain training examples and distributionally shifted examples. Such a classifier can work with existing decision boundary finding methods for further open space risk reduction.
Related Works: Besides the works (Shu et al., 2018; in open-world learning, our work is also related to data augmentation. In CV, Chen et al. (2020) propose a simple image pretraining method based on data augmentation. In NLP, Wu et al. (2020); Lewis et al. (2020a) pretrain language model by contrastive learning on augmented data. Wu et al. (2020) propose word/span deletion, word reorder, word replacement and Lewis et al. (2020a) use paraphrasing method to augment examples. Differently from what we explore in this paper, these works focus on similar instances instead of OOD examples.
In this work, we take advantage of the recent success of pretrained language models. We use the sequence-to-sequence language model BART (Lewis et al., 2020b) for distributionally shifted example creation. BART can fill the masked sentences by generation. Furthermore, we use the finetuned BART 1 on MNLI (Williams et al., 2018) for predicting the relationship between the original text and augmented examples for filtering.

Methodology
Problem Definition: We define a training data set as D = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n )} composed of n examples where the i-th document x i is associated with one of the m seen classes y i ∈ {l 1 , l 2 , . . . , l m }. In the canonical open-world classification setting, a model learns from the training data and either classifies the test instance to one of the m seen classes or reject it as unseen (denoted by l 0 ), i.e., it does not belong to any of the seen classes. Therefore, it is a (m + 1)-class classifier.
In our setting, we create distributionally shifted )} by augmenting the training set D of seen classes into a new augmented class l m+1 . We learn a model f (x) using both in-domain training examples in D and the OOD examples in D A . During prediction, a data point is classified to either: one of the m seen classes from D; or l 0 either because it is classified as l m+1 (from D A ) or because all m seen classes reject it. Therefore, our method is a (m + 2)-class classifier f (x) with the classes C = {l 1 , l 2 , . . . , l m , l 0 , l m+1 }.
This section introduces the creation process of distributionally-shifted instances, model training, and testing procedure.

Distributionally Shifted Data Augmentation
As previously discussed, we do not have the OOD data or unseen classes' examples ready at the training time in most real-world scenarios. The goal of distributionally shifted data augmentation is to create OOD examples from the seen classes' examples. Thus, the model can learn discriminative features for OOD detection and in-domain classification. Distributionally shifted data augmentation inherits from the span replacement (Wu et al., 2020). As shown in Figure 1, there are four steps, namely: 1) chunk the example x in the in-domain training data into pieces; 2) mask each piece iteratively to create masked sentences; 3) replace the <mask> tokens with predicted tokens from the pre-trained generative language model BART to obtain the augmented examples; 4) select the augmented examples by predicting with the the fine-tuned BART on MNLI whether the original and augmented pair is contradiction relation as qualified OOD examples. The outcome of this approach is a list of qualified OOD examples {x A i , . . . } as the pink examples in Figure 1. Our motivation to choose span replacement instead of the standard data augmentation methods: deletion, reorder, paraphrase, and word replacement is the OOD rate among the augmented examples. The reorder and paraphrase contributes in-domain examples. Word deletion and replacement have lower OOD rates than span replacement. As the example shown in Figure 1, span replacement has only 1/3 of the augmented examples seems out-of-domain. This suggests that most tokens in the examples do not decide the semantic or class label of the examples.

After preparing the OOD examples D
we use them together with the in-domain training examples D for supervised (m + 1)-class classification. The class label space is Y = {l 1 , . . . , l m , l m+1 }. Let f E denote the encoder network, Linear(·) is a linear mapping function that maps a representation r to a (m + 1)-dimension logits and Softmax(·). The  (1)

Rejection
Here we present the method for identifying unseen examples during testing/inference. Given the class predictionỹ for the example x from the (m + 1)-class classifier described in Section 2.2, the method applies the decision boundary learning method upon the trained multiclass classifier to further reduce the open space risk. Here, we use the SOTA adjustable decision boundary (ADB) (Zhang et al., 2021) as the boundary finding method. The ADB method aims to learn euclidean distance decision boundaries for every seen class. After training a multiclass classifier, ADB feeds the training examples x i back to the model and gets its representation r i . Based on the represention and class label {(r 1 , y i ), . . . , (r n , y n ))}, it calculates the centroids for each seen class {c 1 , . . . , c m }, and then learns the radius of the boundaries {b 1 , . . . , b m } by tightening same-class's representations to its class-centroid.
Considering we use both in-domain training examples and distributionally shifted instances as input for the model, we inject them to get (m + 1)centroids {c 1 , . . . , c m , c m+1 } and learn (m + 1) boundaries b = {b 1 , . . . , b m , b m+1 } for (m + 1) classes including m seen classes and the augmented class. The testing example is treated as a rejection example if it is out of all decision boundaries or belonging to the augmented class.

Experiments
We evaluate our method on three datasets: Banking (Casanueva et al., 2020) Table 4. Following (Shu et al., 2017;Zhang et al., 2021), we experiment with three portions of 25%, 50%, and 75% from all the classes as seen classes. For distributionally shifted instance creation, we use NLTK 5 to chunk the sentence and set BART to predict top-3 candidates with a beam size of 5. Regarding the model architecture and training, we keep ADB 6 setting that utilizes BERT (Devlin et al., 2019) as the base for multi-class classification. We use the NVIDIA Tesla V100 GPU. In representation learning, we use all qualified distributionalshift instances associated with the seen classes and maintain class balance in a batch. The training batch is 128, and the learning rate is 2e-5. For boundary learning, the learning rate is 0.05. We report the averaged scores and standard deviation on five random seeds.
ODIST is our proposed solution that includes the distributionally shifted instances in Sec.  Table 1: Results of ODIST (including standard deviation) and ADB. ADB's scores are from the original paper (Zhang et al., 2021).    ODIST-DB does not use any decision boundaries and treats the samples predicted to the augment class as rejected, as shown in Eq. 3.
We compare our method to ADB that trains a (m)-class classifier on the in-domain training examples and learns decision boundaries for m seen classes. It is the SOTA method in open text classification. We report the F 1 score for the unseen class l 0 , averaged F 1 score for seen classes, and accuracy of all test data. The unseen class's precision, recall, and F 1 -score are reported regarding the ablation study between ODIST and ODIST-DB. We also compare augmentation methods without decision boundary but using Eq. 3 that directly treat the examples predicted into the class l m+1 as rejected. The precision, recall, and F 1 -score of the unseen class on the OOS 25% setting are reported. The compared methods are: Word Delete 50% that randomly deletes 50% words in the original sentence, Word Reorder 50% that reorders 50% words in the original sentence; ODIST-DB-Select, which is the span-replacement proposed in Section 2.1 without the last selection step; and ODIST-DB that is our proposed data augmentation method.
Result Analysis: As shown in Table 1, we notice that ODIST performs better than the ADB in all scenarios. This supports that distributionally shifted instances can help open-world classification. It is promising to see the performance improvement on both unseen's and seen's examples. This suggests that distributionally shifted instances help model learning features for both in-domain classification and novelty detection. We notice that the performance improvement decrease with the increase of seen ratio. It is because there are more training examples for feature learning when the seen ratio is high. Distributionally shifted instances are more helpful in low seen-ratio scenarios.
In Table 2, ODIST is compared to ODIST-DB. The recall scores of ODIST-DB are low. This suggests that the diversity of distributionally shifted instances is limited, and they cannot cover all OOD test samples. It is because of the mask portion and Bart. On the other hand, the precision scores are high. This shows the OOD quality of the distribu-tionally shifted instances. ODIST can achieve a high recall with a slight performance drop on the precision with the decision boundaries. There is a decreasing trend of precision with the increase of seen ratio. This is because our current filter mechanism compares the augmented example to the original one. With more seen classes, the augmented examples are likely similar to other classes.
We compare ODIST-DB to another three data augmentation methods: ODIST-DB-Select, Word Delete 50%, and Word Reorder 50%. In Table  3, ODIST-DB and ODIST-DB-Select have much higher recall scores of the unseen class than Word Delete 50% and Word Reorder 50%. This suggests that Word Delete 50% and Word Reorder 50% cannot produce distributional-shift points and enrich discriminative features. Span-replacement-based methods (ODIST-DB and ODIST-DB-Select) inject new text spans that help open representation learning. All methods have good performances on the precision of unseen class though some methods mix in-distribution and OOD in their augmented examples. It is because we ensure class balance in a batch during open representation learning and bad augmented examples have lower weight in the loss than gold data (in-distribution training data). However, we still can observe that ODIST-DB has the highest precision. This suggests that the 'select' step in distributional-shift data augmentation is helpful. One venue for future work is to efficiently and effectively create diverse augmented data.

Conclusion
In this paper, we study the open-world classification problem. Differently from existing research, we propose to learn an open representation. To achieve that goal, we propose a novel and simple method to create distributionally shifted instances from the training examples. The experimental results show that the method is effective and improves over SOTA results on three classification datasets.