Open World Classification with Adaptive Negative Samples

Open world classification is a task in natural language processing with key practical relevance and impact.Since the open or unknown category data only manifests in the inference phase, finding a model with a suitable decision boundary accommodating for the identification of known classes and discrimination of the open category is challenging.The performance of existing models is limited by the lack of effective open category data during the training stage or the lack of a good mechanism to learn appropriate decision boundaries.We propose an approach based on Adaptive Negative Samples (ANS) designed to generate effective synthetic open category samples in the training stage and without requiring any prior knowledge or external datasets.Empirically, we find a significant advantage in using auxiliary one-versus-rest binary classifiers, which effectively utilize the generated negative samples and avoid the complex threshold-seeking stage in previous works.Extensive experiments on three benchmark datasets show that ANS achieves significant improvements over state-of-the-art methods.


Introduction
Standard supervised classification assumes that all categories expected in the testing phase have been fully observed while training, i.e., every sample is assigned to one known category as illustrated in Figure 1(a).This may not always be satisfied in practical applications, such as dialogue intention classification, where new intents are expected to emerge.Consequently, it is desirable to have a classifier capable of discriminating whether a given sample belongs to a known or an unknown category, e.g., the red samples in Figure 1(a).This problem can be understood as a (C + 1) classification problem, where C is the number of known categories and the additional category is reserved for open (unknown) samples.This scenario is also known as multi-class open set recognition (Scheirer et al., 2014).
To discriminate the known from the open samples during inference, it is necessary to create a clear classification boundary that separates the known from the open category.However, the lack of open category samples during training makes this problem challenging.Current research in this setting mainly focuses on two directions.The first direction mainly estimates a tighter decision boundary between known classes to allow for the possibility of the open category.Existing works in this direction include the Local Outlier Factor (LOF) (Breunig et al., 2000;Zhou et al., 2022;Zeng et al., 2021), Deep Open Classification (DOC) (Shu et al., 2017) and Adaptive Decision Boundary (ADB) (Zhang et al., 2021b).LOF and ADB calibrate the decision boundary in the feature space while DOC does it in the probability space.
The second direction deals with learning a better feature representation to make the boundaryseeking problem easier.In this direction, Deep-Unk (Lin and Xu, 2019a) and SEG (Yan et al., 2020) added constraints to the feature space, SCL (Zeng et al., 2021) and Zhou et al. (2022) fine-tuned the feature extractor backbone with contrastive learning.Zhan et al. (2021) considered introducing open samples from other datasets as negative, and Shu et al. (2021) generated samples with contradictory meanings using a large pretrained model.The latter two deliver large performance gains.
These improvements demonstrate the significance of negative samples in determining the boundary between the known and open categories.To accomplish the same in the absence of additional datasets or knowledge, we propose a novel negative-sample constraint and employ a gradientbased method to generate pseudo open category samples.As shown in Figure 1(d), negative samples are adaptively generated for each category to closely bound each category.
Given the generated negative samples, we then empirically find that using auxiliary one-versus-rest binary classifiers can better capture the boundary between the known and the open category, relative to a (C + 1)-way classifier (Zhan et al., 2021), where all the open categories, possibly distributed in multiple modes or arbitrarily scattered over the feature space, are categorized into one class.
Specifically, we first learn a C-category classifier on known category data.Then for each known category, we learn an auxiliary binary classifier, treating this category as positive and others as negative.During inference, one sample is recognized as open if all the binary classifiers predict it as negative, thus not belonging to any known category Our main contributions are summarized below: • We propose a novel adaptive negative-samplegeneration method for open-world classification problems without the need for external data or prior knowledge of the open categories.Moreover, negative samples can be added to existing methods and yield performance gains.
• We show that synthesized negative samples combined with auxiliary one-versus-rest binary classifiers facilitate learning better decision boundaries and requires no tuning (calibration) on the open category threshold.
• We conduct extensive experiments to show that our approach significantly improves over previous state-of-the-art methods.

Related Work
Boundary Calibration The classical local outlier factor (LOF) (Breunig et al., 2000) method is a custom metric that calculates the local density deviation of a data point from its neighbors.However, there is not a principled rule on how to choose the outlier detection threshold when using LOF.(2017) employed one-versus-rest binary classifiers and then calculates the threshold over the confidence score space by fitting it to a Gaussian distribution.This method is limited by the often inaccurate (uncalibrated) predictive confidence learned by the neural network (Guo et al., 2017).Adaptive decision boundary (Zhang et al., 2021b), illustrated in Figure 1(c), was recently proposed to learn bounded spherical regions for known categories to contain known class samples.Though this post-processing approach achieves state-of-theart performance, it still suffers from the issue that the tight-bounded spheres may not exist or cannot be well-defined in representation space.Due to the fact that high-dimensional data representations usually lie on a low-dimensional manifold (Pless and Souvenir, 2009), a sphere defined in a Euclidean space can be restrictive as a decision boundary.Moreover, the issue can be more severe if certain categories follow multimodal or skewed distributions.
Representation Learning DeepUnk (Lin and Xu, 2019a) trains the feature extractor with Large Margin Cosine Loss (Wang et al., 2018).SEG (Yan et al., 2020) assumes that the known features follow the mixture Gaussian distribution.Zeng et al. (2021) and Zhou et al. (2022) applied supervised contrastive learning (Chen et al., 2020) and further improve the representation quality by using k-nearest positive samples and negative samples collected from the memory buffer of MOCO (He et al., 2020).Getting a better representation trained with known category data only is complementary to our work, since a better pretrained backbone can further improve our results.Recent works found that it may be problematic to learn features solely based on the known classes; thus, it is crucial to pro- , where f ψ cls (•) is a classifier that takes z as input and the output dimension is the number of known categories C. f j ψ cls (z) represents the output logit for the j-th category.A well-trained feature extractor f ψenc (•) and a high-quality classifier f ψ cls (•) can extract representative features of each category and ensure good performance on the classification of known categories during the inference stage.

Open Category Recognition
Once the classifier for the known categories is available, the task is to recognize samples from the open category versus the ones from known categories.As mentioned above, directly using the known category classifier f ψ cls (•) can result in poor performance (Hendrycks and Gimpel, 2016), while using a (C + 1) category classifier setting is complicated due to the need to find proper samples to obtain a suitable decision boundary for the open category (Scheirer et al., 2012;Liang et al., 2018;Shu et al., 2021).In this work, building upon ideas from one-class classification (Schölkopf et al., 2001;Ruff et al., 2018) and one-vs-all support vector machines (Rifkin and Klautau, 2004), we propose a one-versus-rest framework via training simple binary classifiers for each predefined category.Based on this framework, we then introduce an effective open-sample generation approach to train these classifiers in Section 3.3.
We build an auxiliary one-versus-rest binary classifier for each known category, and take m-th category as an example to illustrate.Given a text sequence x, we use the BERT pretrained with classification loss as the feature extractor f ψ enc (•) to extract features z ∈ R d to be fed to the binary classifiers, where d is the dimension of the feature space.Each category is provided with a binary classifier denoted as g θ cls m (z) : R d → R, such that if g θ cls m (z) > 0 then the input text x belongs to the m-th category or vice versa belongs to any of the other categories.We parameterize the entire binary classification framework for the m-th category as θ m = (ψ enc , θ cls m ).To learn each binary classifier g θ cls m (•) from training data D, we first construct a positive set {x 1 , x 2 , . . ., x Nm } using data points with label l m from D and a negative set { x1 , x2 , . . ., xN−Nm } by data points not in category l m but also from D. The total number of samples within category m is N m , and N − N m is the number of remaining samples in D. Each binary classifier is optimized by minimizing the binary cross-entropy loss function L rest : During the inference phase, we have

Adaptive Negative Sampling (ANS)
In practice, it is problematic to learn a good binary classifier with only the aforementioned negative samples from D. The sample space of the open category is complicated, considering that newcategory samples could originate from different topics and sources, relative to the known classes.So motivated, some existing methods introduce additional external data as negative samples.
To alleviate the issue associated with the lack of real negative samples, we propose to synthesize negative samples x.Considering that it is hard to create actual open-category text examples, we choose to draw virtual text examples z in the feature space (Miyato et al., 2016;Zhu et al., 2019).Compared with the token space of text, the feature space is typically smoother (Bengio et al., 2013), which makes it convenient to calculate gradients (Wang et al., 2021).
For all known samples in a category l m , points that are away from these samples can be recognized as negative to classifier g θ cls m (•).The entire feature space R d contains essentially an uncountable set of such points, among which we are mostly interested in those near the known samples.Consequently, capturing these points will be helpful to characterize the decision boundary.
To give a mathematical description of these points, we assume that data usually lies on a low-dimensional manifold in a high-dimensional space (Pless and Souvenir, 2009).The lowdimensional manifold can be viewed as a local-Euclidean space, thus we can use the Euclidean metric to measure distances locally for each known data z i .Under this assumption, the set of pseudo negative samples N i (r) for z i , which we call adaptive synthesized open set, can be described as follows, where r is the distance radius and γ > 1 is a hyperparameter.Note that each known sample z i has an associated adaptive synthesized open set.As defined above, this set is subject to two inequalities.The first keeps synthesized samples away from all known samples within category m.The second implies that the synthesized samples should not be too far from the known samples.An intuitive geometric interpretation is that when j = i, the space implied by these two constraints is a spherical shell with inner radius r and outer radius γ • r.
To get the radius r, we first calculate the covariance matrix Σ of z using known samples from category m and choose r, s.t.r ≤ 2 Tr(Σ) and γr ≥ 2 Tr(Σ).This is under the consideration that 2 Tr(Σ) is the average Euclidean distance between random samples drawn from a distribution with covariance matrix Σ.
The estimation is supported by the following proposition, Proposition 1 The expectation of the euclidean distance between random points sampled from distribution with covariance matrix Σ is smaller than 2 Tr(Σ), i.e.
The proof can be found in supplementary.In our experiments, we fix γ = 2 and r = 8.The choice of r is relevant to the covariance matrix of the features in representation space.The detailed justification for our selection is provided in Appendix A.3.Ablation studies (Figure 3) show that model performance is not very sensitive to the chosen r.
Binary Classification with ANS According to Equation 3, each sample from a known category m contributes an adaptive synthesized set of open samples N i (r).The classifier g θ cls m (•) is expected to discriminate them as negative.The corresponding objective function is the binary cross-entropy loss, where zi is sampled from N i (r) and N m is the total number of known samples with category l m .However, there exist uncountably many points in N i (r).
Randomly sampling one example from N i (r) is not effective.Alternatively, we choose the most representative ones that are hard for the classifier to classify it as negative.Consistent with this intuition, the max(•) operator is added to select the most challenging synthetic open sample distributed in N i (r).
Finally, the complete loss accounting for open recognition is summarized as: where λ is a regularization hyperparameter.

5:
Calculate loss L1 = L real (XB, XB) using Eq. 2 6: Calculate the feature zB over the positive samples.7: Adaptive Negative Sampling: Calculate loss ℓ(zB + ϵ) using Eq. 5 11: end for 13: Calculate α using Eq.7 14: Directly minimizing the objective function in Equation 6 subject to the constraint in Equation 3 is challenging.In the experiments, we adopt the projected gradient descend-ascend technique (Goyal et al., 2020) to solve this problem.
Projected Gradient Descend-Ascent We use gradient descent to minimize the open recognition loss L open and gradient ascent to find the hardest synthetic negative samples z.The detailed steps are summarized in Algorithm 1.As illustrated in Figure 2(c), the sample zi ′ = zi +ϵ directly derived from gradient ascent (line 11 of Algorithm 1) might be out of the constraint area N i (r).We then project to the closest zi within the constraint such that zi = arg min u ∥ zi ′ − u∥ 2 , ∀ u ∈ N i (r) (Boyd et al., 2004).Unfortunately, direct search within N i (r) defined in Equation 3 requires complex computation over entire training data D. Based on our assumption that the training samples lie on a low-dimensional manifold and the empirical observation that z′ i is always closest to the corresponding positive point z i relative to other positive points, N i (r) can be further relaxed to the sphere shell around sample z i : We can then directly find the synthetic negative sample via a projection along the radius direction, i.e., zi = zi , where α is adjusted to guarantee zi ∈ N i (r): In the future, we would like to consider relaxing these constraints by only considering the nearest k points instead of all the points within a category.

Experiments
We conduct experiments on three datasets: Banking, CLINC and Stackoverflow.Details and examples of the datasets are found in Appendix A.4.The baselines include threshold finding methods, MSP (Hendrycks and Gimpel, 2016), DOC (Shu et al., 2017), OpenMax (Bendale and Boult, 2015), ADB (Zhang et al., 2021b); and feature learning methods, DeepUnk (Lin and Xu, 2019a); and negative data generation method SelfSup (Zhan et al., 2021).We did not include results from ODIST (Shu et al., 2021) because their method relies on MNLI-pretrained BART, which is not currently public and their model performance drops dramatically if not coupled with ADB.Note that SelfSup uses additional datasets, without which the accuracy on 50% CLINC drops from 88.33 to 83.12.
Our approach performs better than most previous methods, even better than the method using additional datasets, with the greatest improvement on CLINC.This is in accordance with Self-Sup (Zhan et al., 2021), which also benefits the most on CLINC by adding negative samples.This implies that our synthesized negative samples are of high quality and could possibly be used as extra datasets in other methods.
The average performance gain in these three datasets decreases as the known category ratio increases, i.e., compared to the strongest baseline ADB, our accuracy improvements in the three datasets are 5.08, 3.11, 1.42 under the setting of 25%, 50%, 75%.With more known categories available, the more diverse the known negative samples will be, allowing the model to better capture the boundaries of the positive known categories while reducing the impact of synthetic samples.
The comparison with baselines on F1-open and F1-known can be found in Appendix A.3.

Discussion
Synthesized Negative is Beneficial for a Variety of Structures To investigate the contribution of the synthesized samples and the structure of one-versus-rest, we performed experiments adding the synthesized samples to two well-known baselines, MSP (Hendrycks and Gimpel, 2016) and ADB (Zhang et al., 2021b) as shown in Table 3. Specifically, the C-way classifier in MSP and ADB is replaced by a (C + 1)-way classifier, with an extra head for the synthesized negative samples.See Appendix A.7 for details.
We observe that performance increases on all baselines with synthesized negative samples.The synthesized samples behave like data augmentation, leading to better representation of positive samples.
Further, synthesized samples benefit one-versusrest the most.The difference, we believe, stems from the model's flexibility on boundary learning.The open category may contain sentences with various themes, making it difficult for a single head of a (C + 1)-way classifier to catch them all.This one-versus-rest flexibility comes at the cost of more classifier parameters.However, compared to the huge feature extractor BERT, the number of additional parameters is relatively small.Distillation techniques can be used to build a smaller model if known_positive synthesized_negative known_negative open_negative necessary, for instance, where there are thousands of known categories.
Adaptive Negatives Samples Generation Our adaptive negative sample generation consists of three modules (a) adding Gaussian noise to the original samples (line 8 in Algorithm 1).(b) gradient ascent (line 10 ∼ 11) (c) projection (line 13 ∼ 14).We add each module to the baseline in turn to study their importance.The baseline experiment uses the vanilla one-versus-rest framework described in Section 3.2, without the use of synthesized negative samples.Experiments are conducted on CLINC as shown in Table 2.
The following describes our findings from each experiment: (i) Adding samples with noise as negative alleviates the overconfidence problem of the classifier and improves the results significantly.The noise level needs to be designed carefully since small noise blurs the distinction between known and open, while large noise is ineffective.
(ii) Constraining synthesized samples to N(r) improves performance by keeping synthesized samples from being too close or too far away from positive known samples.
(iii) Adding a gradient ascent step further enhances performance.The improvement over the previous step is marginal.Our hypothesis is that the calculated gradient could be noisy, since the noise we add is isotropic and may be inconsistent with (outside of) the manifold of the original data.
Radius r Analysis In the adaptive negative sample generation process, the radius r and multiplier γ are two hyperparameters that determine the upper and lower bounds of the distance between the synthesized sample and the known sample.To investigate the impact of radius, we fix γ to 2 and increase the r from 1 to 256.Note that 8 is our default setting.
As illustrated in Figure 3, the performance gradually drops when the radius r increases, because the influence of the synthesized negative examples reduces as the distance between them and the positive samples grows.When the radius r decreases, the classifier may be more likely to incorrectly categorize positive as negative because the synthesized negative samples are closer to the known positives, resulting in a decrease in accuracy and F1 score on the banking and CLINC datasets.However, the performance on Stackoverflow improves.We hypothesize that there is a better data-adaptive way to estimate the radius r to improve the performance even further, for example, using k nearest neighbor instead of all the data in a category.We leave this as interesting future direction.
In summary, we observe that the performance is affected by the radius, but comparable results can be obtained for a wide value range.They are all better than the vanilla one-versus-rest baseline, which lacks the generated negative samples.The accuracy of baselines on Banking, CLINC and Stackoverflow is 58.09, 71.80 and 64.58, respectively.
Visualization Figure 4 shows the t-SNE representation of the features extracted from the second hidden layer of one-versus-rest classifier g cls m .Randomly chosen three known categories, each corresponds to a one-versus-rest classifier, yield three figures.The known positive/negative samples (blue) are clustered because the features are extracted from a pretrained C-way classifier Section 3.1.Open samples (pink) are scattered, some of which overlap with the known positives (see the middle figure).Our synthesized negatives work as expected; they are close to the known positives

A.1 Ethical Consideration
The topics of the three datasets we use in this paper are relatively simple, covering only the information needed for classification (check Table A.1).The category labels are either everyday intentions or technical terms in computer science.There are no potentially sensitive topics or contents that we are aware of.All three datasets have been published and are included in our appendix.

A.2 Related work: Adversary Augmentation
The gradient descend-ascend technique has been used successfully in adversarial attacks (Madry et al., 2018;Zeng et al., 2021); however, it differs from ours in terms of motivation and loss formulation.Madry et al. (2018) sought for samples that were similar to the training sample but had a substantial loss given the paired label.The addition of generated samples during training may strengthen the model's resistance to adversarial attacks by avoiding inputs that are indistinguishable from genuine data but improperly categorised.The associated optimization formula is where y, x are the training data, and l could be any classification model parameterized by θ. S is the an adversarial perturbation l ∞ ball.
Our work targets on shrinking the decision boundary.We need to treat the samples with positive labels in a specific region N(r)(defined in Equation 3) as negative, i.e.
where z is the positive sample from dataset D, l is a binary classifier with parameter δ.This equation behaves the same as Equation 5.

A.3 Explanation on Radius r
Proposition 2 The expectation of the euclidean distance between random points sampled from distribution with covariance matrix Σ is smaller than 2 Tr(Σ), i.e.
Proof: Given that we are measuring the distance between samples drawn from the same distribution, we could subtract a constant value from both variables and assume that the distribution's expectation is zero.If x and y are random variables independently sampled from distribution with covariance matrix Σ and zero mean, we could have: For a random variable Z, Jensen's inequality gives us The combination of the two equations above proves the proposition, In experiment, we choose the mean of the last layer of BERT as the latent representation z ∈ R 768 .When calculating the trace, only the variance of each dimension, are required.On three datasets, the predicted distance per category falls primarily into [8,12], we fix r = 8 for all the experiments.
For each positive point, we could sample several adaptive negative samples.The distance between the synthesized negative and the chosen positive is determined by r.Meanwhile, we can also calculate the distance between the synthesized negative and other positive known samples.
We find that even when the radius is set to be less than the average distance, the synthesized negative samples have a much greater distance to other known points.In theory, known samples are on a low-dimensional manifold, whereas synthesized points are in a high-dimensional space, and the probability of sampled points falling into the manifold is zero.We calculated the distance between the synthesized sample and other known samples in the same category empirically, and discovered that their distances are nearly twice the average distance.
A further ablation study on different options of the radius can be found in the main context, where we have comparisons over different radius, i.e., r ∈ {1, 4, 8, 16, 64}.Training on this part takes about 10-20 minutes, depending on when the early stopping is triggered.

one-versus-rest Binary Classification
During the training of one-versus-rest structure, we fix the parameters of the feature extractor ψ enc .
For the one-versus-rest module, the feature z is chosen as the mean of the output of the BERT model's last layer.The classifier is a fully connected three-layer neural network with ReLU as the activation function.The numbers of hidden neurons are (256,64), respectively.Dropout is added per hidden layer.The learning rate of the classifier is 1e−3 for Banking and CLINC, 3e−4 for Stackoverflow.The total number of epochs for each classifier is min(C, 20) to avoid overfitting.γ is 0.5.
Currently, we train each one-versus-rest classifier individually, and this takes roughly a minute per head.As a result, the total time grows linearly with the number of known categories.Parallel training of multiple heads can increase efficiency if necessary.

Model Size
The parameters of the BERT backbone model and the ovr classifier are approximately 109 million and 0.2 million, respectively, implying that the maximum number of parameters from oneversus-rest is only about one-fifth of BERT (75% CLINC).
Reproducibility Checklist: hyper-parameter search We didn't include results from the validation set considering there is a huge gap between the current validation set and test set; test set contains open category samples while the validation set does not.
It is difficult to study the hyper-parameter setting because we lack an effective validation set with open category samples and the testing set is unavailable during training.To solve this, we construct a "pseudo dataset" by selecting a subset as "sub-known" from all known categories and treating others as "sub-open".Taking 50% CLINC as an example, we take a quarter of the known category as "sub-known" and the others as "sub-open".We discover that the rules we developed using these synthesized datasets can be transferred to formal experiments.We choose the proper hyper-parameter according to F1.
The hyperparameters we manually tried include training epoch (10,20), learning rate (1e − 3,1e − 4 for the classifier head, 1e − 4, 5e − 5, 1e − 5 for BERT, note that the learning rate of the classifier head is always larger than BERT).The ablation study and the analysis on radius r can be found in Figure 3 and Appendix A.3.The hyper-parameters in gradient ascend are not sensitive to the final experiments.
A.6 More results

F1-known and F1-open
The definition of F1known and F1-open can be found in Section 4. Table A .3 shows the comparisons between the baselines and ours.
Reproducibility Checklist: Differences on Datasets During this process, we found that dataset CLINC is the robustest to change of hyperparameters while dataset Stackoverflow is the weakest.A similar observation also shows in Zeng et al. (2021).It works the best on CLINC and worst on Stackoverflow.
We hypothesize that the difference comes from the quality of the raw data.As shown in Table Table A.1, the category of the input in Stackoverflow is usually included in the original sentence and we name them "easy".Rare sentences do not follow to this rule and we call them "hard".This leads to an observation in empirical experiments that the number of training epochs should be controlled in a limited range; otherwise many open category samples would be wrongly categorized to the known category.
This is in consistent with the finding in noisy labeling classification.The neural network will first fit the clean label before overfitting the noisy labeled samples.Under the current setting, the "hard" corresponds to the noisy sample.The oneversus-rest will first fit the easier one, followed by the harder one.When the harder one is classified correctly, many open categories could also be classified into this known category.
ADB (Zhang et al., 2021b) avoids this problem by working directly on the pre-trained features.It can statistically filter out the influence of noisy samples.Though ADB does not require extra hyperparameter tuning, we found that the position of features extracted from the model has an impact on the final performances.
In summary, the differences between datasets are an intriguing topic that merits further investigation in the future.

Reproducibility Checklist: Standard Deviation
As shown in Table A.4, larger known category ratios are more likely to be associated with lower variance; this is to be expected because more samples make the training more stable.Note that our MSP with synthesized negatives differs from (Zhan et al., 2021) in two aspects, (i), different ways to choose the negative samples.ii), their work added synthesized negative samples to the validation set, while ours uses the origin validation set.
ADB with negative sampling ADB training has two steps.The first step is to learn a good feature extractor using C-way classifier.The second step is to learn the boundary of each category in the pre-trained feature space.
Our modification is the first step.We replace the origin classifier with a C + 1-way classifier.The extra head is designed for the synthesized negative samples.The inference step is kept the same as the original method.

Figure 1 :
Figure 1: Illustration of previous methods and our proposed one.C0, C1, and C2 are three categories.The boundary is used to discriminate known and open (unknown) categories.(a) Boundary learned by supervised learning.(b) Optimal decision boundary.(c) ADB method which has a closed decision boundary, but may capture irrelevant points.(d) Proposed ANS method.
We assign a testing example to the open category l 0 if it is detected as unknown in all C binary classifiers.Otherwise, we pass the example to the known classifier to categorize the known categories.The entire inference pipeline is illustrated in Figure 2(b).

Figure 3 :
Figure3: Ablation study on the radius r used in adaptive negative sampling process under the setting with 50% known categories.

Figure 4 :
Figure 4: t-SNE plots of the feature extracted from the testing set of CLINC with 50% known categories.Each panel corresponds to a one-versus-rest classifier g θ cls m with different known category m acting as the positive.Each panel's lower right corner has a square that enlarges the known positive data region to show the effectiveness of the synthesized negative samples.(Best viewed in color).

Table 1 :
r Results of open world classification on three datasets with different known class proportions.* indicates the use of extra datasets during the training.The first five results of the baseline model are from Zhang et al. (2021b).The results for SelfSup are from Zhan et al. (2021).
Lin and Xu (2019a)ly the same settings asShu et al. (2017)andLin and Xu (2019a).For each dataset, we sample 25%, 50%, 75% categories randomly and treat them as the known category.Any other categories out of the known categories are grouped into the open category.In the training and validation set, only samples within the known category are kept.All the samples in the testing set are retained, and the label of samples belonging to open categories is set to l 0 .Importantly, samples from the open category are never exposed to the model in the training and validation process.
Zhang et al. (2021a)e model needs to identify the samples with the open category, as well as classify the known samples correctly.FollowingShu et al. (2017);Zhang et al. (2021b), we use accuracy and macro F1 as our evaluation metrics.4.1 ResultsTable 1 compares our approach with previous stateof-the-art methods using accuracy and F1.Our implementation is based onZhang et al. (2021a).

Table A .
1: Extracted samples from three main datasets.