One-class Text Classification with Multi-modal Deep Support Vector Data Description

This work presents multi-modal deep SVDD (mSVDD) for one-class text classification. By extending the uni-modal SVDD to a multiple modal one, we build mSVDD with multiple hyperspheres, that enable us to build a much better description for target one-class data. Additionally, the end-to-end architecture of mSVDD can jointly handle neural feature learning and one-class text learning. We also introduce a mechanism for incorporating negative supervision in the absence of real negative data, which can be beneficial to the mSVDD model. We conduct experiments on Reuters and 20 Newsgroup datasets, and the experimental results demonstrate that mSVDD outperforms uni-modal SVDD and mSVDD can get further improvements when negative supervision is incorporated.


Introduction
One-Class Classification (OCC), a special classification problem, aims to learn a model on the basis of training samples only from one class. The learned model is expected to make an accurate description of the class (so called target or normal) and then to distinguish the target from samples for negative classes during testing (Moya et al., 1993;Tax, 2002). The one-class classification problem has arisen in many real-world applications, including anomaly or novelty detection (Roberts, 1999;Chandola et al., 2010;Gupta et al., 2013), bioinformatics (Alashwal et al., 2006), and especially computer vision (Rodner et al., 2011;Ruff et al., 2018).
One-class text classification would be beneficial to the scenario where anomalous text contents (e.g., web pages, spam emails) (Yu et al., 2004) need to be detected, and only a positive training corpus is available. One of the early work on oneclass text classification is Manevitz and Yousef (2001), who implemented versions of one-class support vector machines (OC-SVM) (Schölkopf et al., 2001) and showed good performances over the Reuters dataset (Dumais et al., 1998). OC-SVM and support vector data description (SVDD) (Tax and Duin, 2004) are boundary-based methods (Tax, 2002). Both try to describe the target data using a boundary. SVDD learns an optimal hypersphere with the minimum radius to include the most target data, while OC-SVM builds a hyperplane to maximally separate the data points from the origin where outlier examples lie around.
Reconstruction-based approaches, including Au-toEncoder (Jacobs, 1995) and principal component analysis (PCA) (Bishop, 1995), which aim to learn a more compact representation for the description of target data. The compact representation could be a set of prototypes or subspaces obtained by optimizing a reconstruction error on the target training data.
Regarding the features for representing text in OCC, document-to-word co-occurrence matrices or hand-crafted features have been commonly used in most of the previous work Yousef, 2001, 2007;Kumaraswamy et al., 2015). Pretrained vectors have been popular for many NLP tasks (Mikolov et al., 2013;Bengio et al., 2003). The recent context vector data description (CVDD), proposed by Ruff et al. (2019), fully uses word embedding knowledge and a neural network structure to process one-class classification problems. Ruff et al. (2018) introduced deep support vector data description (deep SVDD), a fully unsupervised method for deep one-class classification for image data. Deep SVDD learns to extract the common factors of target training samples with a neural network to minimize the radius of a hypersphere that encloses the network representations of the data. The learned hypersphere, with a center c and a neural feature transformer φ(x), can be an end-to-end feature learning and one-class classification model.
Target data samples may have distinctive distributions that are located in different regions. Therefore, uni-modal deep SVDD with one hypersphere may not be enough to describe the target samples. In this work, we extend deep SVDD to multiple modes, where each mode describes the target samples from a distinctive aspect. Given our multimodal deep SVDD, mSVDD in short, we can create an ensemble set of hyperspheres with different centers to build a better one-class model. Ghafoori and Leckie (2020) proposed deep multi-sphere SVDD (DMSVDD), a similar but different work from ours. We will also discuss the relationship between the two and compare them in the experiments.
In one-class classification, only samples from the target class are available for training, while the model needs to discriminate between the target class and other classes in testing. Due to the unavailability of training samples from negative classes, it is hard for the one-class models to learn effective discrimination information, especially for mSVDD with a multi-layer neural structure. In this study, we also propose an architecture for improving the discrimination ability of mSVDD by incorporating negative supervision. Specifically, we define two kinds of losses, contrastive and triplet, for joint training with the objective function of mSVDD, which is expected to enhance the discriminative power of mSVDD.
In summary, the main contributions of this work are as follows. 1) We propose a general one-class neural learning framework, called mSVDD, to extend the uni-modal deep SVDD to end-to-end multimodal. 2) We also prove that three one-class models, deep SVDD, DMSVDD, and CVDD, are all special cases of the mSVDD model. 3) We propose two approaches for effectively incorporating negative supervision information to improve the performance of the proposed mSVDD.

SVDD
SVDD is a support vector learning method for oneclass classification. It aims at constructing an optimal boundary in a feature space that includes almost all normal target data, given only the tar-get training samples, T = {x 1 , ..., x n }, x i ∈ X , where n ∈ N is the size of the training data, and X is a compact subset of R d . The main idea of SVDD is to optimize a hypersphere with a center c and radius R, that encloses the majority of the data. SVDD solves the following quadratic problem: where ξ i is a slack variable for allowing a flexible boundary. C is a regularization parameter, that is usually represented by 1 νn , where ν ∈ (0, 1] is a parameter that controls the tradeoff between the radius of the hypersphere and the penalties ξ i . Several efforts have been proposed to extend SVDD with multi-spheres. Hao and Lin (2007) was early work to use multi-sphere SVDD, which was used for multi-class tasks. For one-class tasks, (Xiao et al., 2009) used multi-sphere SVDD to encode multi-distribution target data. Two more efforts have been proposed by (Le et al., 2010(Le et al., , 2013, which found the optimal solution by an iterative algorithm consisting of the following two steps: 1) calculate radii and centers, and 2) calculate the assignments of data to centers.
While one limitation of SVDD, along with its extensions, would be that it has to perform handcrafted feature engineering (Pal and Foody, 2010), the limitation could be solved by incorporating neural models into SVDD.

Deep SVDD
Deep SVDD (Ruff et al., 2018) is an end-to-end deep neural model that not only optimizes the SVDD objective loss but also learns a neural feature transformation. Given target training samples T = {x 1 , ..., x n }, deep SVDD first transforms instance x into a data point of the output feature space with φ, which is a multi-layer neural network of L ∈ N layers with parameters W = {W 1 , ..., W L }. Deep SVDD defines two kinds of loss functions: Soft-boundary deep SVDD : Figure 1: mSVDD with two modes. φ is a neural network. Fivestars denote a center, and black points denote positive target samples, while triangles denote negative outliers that need to be rejected by hyperspheres.
The first penalty term is for samples lying outside the sphere, i.e., when the distance of x i to the center, φ(x i ; W) − c 2 2 , is greater than radius R after the transformation by network φ. The above loss also regularizes the radius and neural weight parameters in the second term. As with SVDD, parameter ν ∈ (0, 1] adjusts the tradeoff between the radius of the hypersphere and the points outside the hypersphere. Schölkopf et al. (2001) proved that, in singleclass classification, ν is the upper bound of the fraction of anomalies, and the lower bound of the fraction of training samples being anomalies or on the optimal boundary. Ruff et al. (2018) proved that this ν-property still holds for uni-modal softboundary deep SVDD.
Another simplified objective that minimizes the mean distance of all positive training samples to the center, the one-class form, can be defined as follows: One-class deep SVDD (simplified form): Here, we can rewrite both the above in a unified form: where [·] + = max{0, ·}, β ∈ {0, R 2 } and regularization parameter C ∈ { 1 n , 1 νn } correspond to the two types of forms.

Multi-Modal Deep SVDD
In this section, we present our mSVDD, a method for deep one-class classification. Unlike a unimodal model with a hypersphere, mSVDD uses a set of hyperspheres to describe target class data and to reject samples from negative classes. Figure 1 shows the general idea of mSVDD with two modes. Consider that we have m modes, each of which is described by a hypersphere M j with center c j and radius R j ; mSVDD uses each M j to describe a distinctive aspect of the target class and then ensemble them. This ensembled deep mSVDD model could provide better descriptions for the target data.
As with deep SVDD, given target training samples T = {x 1 , ..., x n }, mSVDD first transforms instance x into a data point of the output feature space with φ, where φ is a deep neural network of L ∈ N layers with parameters W = {W 1 , ..., W L }. In contrast to deep SVDD, mSVDD uses m hyperspheres to include almost all of the target data with the minimum radii, i.e., 1 m m j=1 R 2 j . As in kernel SVDD and deep SVDD, it should also punish points lying outside the sphere, i.e., if the distance of x to the center c, φ(x; W) − c 2 2 , is greater than radius R. Since we have a set of hyperspheres M = {M 1 , ..., M m }, one choice would be to punish x with respect to each hypersphere by adding j to the loss function. However, the above penalty term is very hard, where one sample should satisfy each j-th constraint corresponding to M j . Therefore, we loosen the constraint. Given non-negative attention weight α ij for x i to each M j , the penalty term can be computed as the weighted average over m constraints. Now, only one ensembled constraint is required, i.e., the sum of radii is greater than the sum distance to the center. Formally, we can define our mSVDD objective as follows: where 1 m m j=1 R 2 j is the regularization term for radii from all m hyperspheres to get a closer boundary around the target data. This form can be seen as mSVDD with weighted soft-boundary constraints, which we call soft-boundary mSVDD.
Although the ν-property, mentioned in Section 2.2, does not hold true for our multi-modal case as it is in general, it is still true when the attention weight α ij is constant for different hyperspheres. This will give us an intuition on the role of ν.
Proposition 1. 1 The ν-property holds if we set equal attention weight to each hypersphere: i.) ν is an upper bound for the fraction of outlier samples and ii.) ν is a lower bound for the fraction of training samples being rejected or on the optimal boundary.

One-class mSVDD (simplified from)
As in deep SVDD, we also have the simplified from and called: one-class 2 mSVDD. If we assume that the majority of the training data is not anomalous, then the radius can be ignored and we can define the simplified mSVDD as follows: where the attention weight α ij will be kept, while the penalty of radius R is deleted.

Unified Form of mSVDD
We can write the two variants of mSVDD (i.e., soft-boundary mSVDD and simplified one-class mSVDD) in a unified form: where β j ∈ {0, R 2 j }. β j = 0 corresponds to simplified one-class mSVDD, and β j = R 2 j corresponds to soft-boundary mSVDD. For x i to the j-th hypersphere, attention weight α ij should be inversely proportional to its distance to center c j . Thus, we define:

Discussions on relationships between mSVDD and other models
Relationship with Uni-modal Deep SVDD Their relationship is obvious and can be summarized by the following proposition.
Proposition 2. Deep SVDD is a special case of the unified form of mSVDD with one hypersphere used.

Relationship between mSVDD and CVDD
CVDD (Ruff et al., 2019) is a one-class model for text data. In CVDD, each training sample x i (i.e., a text) is represented by r self-attention feature vectors S i = (s i1 , ..., s ir ) (Lin et al., 2017). CVDD uses a group of r context vectors C = (c 1 , ..., c r ) to describe the target one-class data, where c k ∈ R p . CVDD tries to reduce the one-to-one reconstruction distance between feature vectors S i and context vectors C. The loss can be defined as: where d(c k , s ik ) computes the distance, and σ ik denotes the attention weight. The following proposition implies the close connection between two to learn one-class text problem. Proposition 3. CVDD is a very special case of one-class mSVDD when mSVDD is applied to textbased tasks under certain conditions.
Proof. W.L.O.G., rewrite the loss function of oneclass mSVDD in a simplified form for each sample as follows: where we drop the regularization terms for weights of φ and radii, and set m = r, is the j−th feature vector of sample x i , and c j is the j−th context vector of target samples. Now the loss functions of CVDD and one-class mSVDD are almost the same.
Relationship between mSVDD and DMSVDD DMSVDD (Ghafoori and Leckie, 2020) also uses multi-hyperspheres to extend SVDD. The loss function of DMSVDD is as follows: where K is the number of hyperspheres 3 , c i * is the nearest center of sample x i and R i * is its radius.
Proposition 4. DMSVDD can be seen as a hardversion of soft-boundary mSVDD if we set the attention weight in some way.
Proof. In the calculation of the attention weight , the temperature parameter δ could influence the assignment of center c k . If we set δ → 0 − , the above formula acts as the argmin operation. 4 In this case, , and 0 otherwise. Now, we can get the form of DMSVDD from soft-boundary mSVDD (Eq. 5) through the adjustment of the attention weight. Therefore, we can prove that DMSVDD is also a special case of mSVDD.
The above relation illustrates key difference between them: DMSVDD puts value on one hypersphere with the largest weight.

Summarizing mSVDD
We summarize the proposed mSVDD in accordance with the discussions presented above. The proposed multi-modal deep SVDD (mSVDD) learns a compact description of one-class data with multiple hyperspheres. mSVDD is also a generic framework that includes deep SVDD, CVDD, and DMSVDD if the corresponding conditions are met.

Multi-Modal Deep SVDD with Negative Supervision
In this section, we incorporate negative supervision into the training of mSVDD. The SVDD-related models are usually trained with only positive samples from the target one-class, while, if negative samples (samples which should be rejected) are available, the models can be extended to train with them to improve the description (Tax, 2002). Note that these samples are not necessarily required to be from "real" negative class. In our experiment, we use some external data as the pseudo-negative samples. Given a set of extended training samples T = {(x 1 , y 1 ), ..., (x n , y n )}, where the first n samples are labeled y i = 1, denoting positive, whereas the others are labeled y i = 0, which denotes negative samples that should be rejected by mSVDD. Our mSVDD is represented with m hyperspheres and is formulated as It is required that the positive samples should be inside the m hyperspheres, while the negative samples should lie outside. Given training samples composed of positive and a negative samples, we can first get their corresponding distances to each center c j . The goal of optimization should be to pull the positive samples closer the center and to push the negative ones away. Formally, we define the distance between one sample x i and one center c j as There are usually two types of losses to obtain the discriminative loss.
Contrastive type: The contrastive-type loss directly optimizes the distance by encouraging the distance between a positive sample and a center to be smaller, while it forces the larger distance to a negative sample: where R 2 j can be seen as a margin (or threshold ) with a function that prevents too much effort from being wasted in enlarging/reducing distances (Hadsell et al., 2006).
Triplet type: The triplet-type loss is defined for a pair of positive sample x i and negative sample x i . If we consider center c j as an anchor representative of target data, the triplet loss punishes only when d ij , the distance from x i to c j , is greater than d i j , the distance from x i to c j , with a margin τ > 0: For clarity, Eqs. 10 and 11 show only the two types of losses for one hypersphere. Multi-modal version can be obtained by sum operation over j ∈ {1, ..., m}.
The triplet loss forces only positive samples to be closer to the center than negative samples, and the contrastive loss requires only keeping the distances for negative samples above the radius. These two types of objectives are easy to achieve, especially when we assume that negative samples are "not real." This can result in failing to make a full use of negative supervision. Therefore, we will reformulate both L T ri d and L Con d .

Reformulating Contrastive and Triplet Losses
Normalization layer In neural models with the contrastive or triplet loss, it is a common strategy to normalize the feature representations of samples for training stability (Schroff et al., 2015;Wang et al., 2017). Therefore, we apply the normalization to the input vectors:x = x/ |x i | 2 + , where > 0 is a value avoiding division by zero.
Reformulation Given a center c j and positive and negative samples, we can use the probability form in the optimization objective, rather than the two non-probabilistic ones: L Con d and L T ri d . We introduce p(y i = 1|x i , c j ), which is the probability that a hypersphere with center c j accepts the sample x i , and define it as follows: where denotes the feature output vector of x i , and s is a scale hyper-parameter for preventing failed convergence (Wu et al., 2018) after the normalization. For each sample x i , c j acts as a pseudo-weight vector for the classification of the j-th hypersphere of mSVDD. Thus, given p(y i = 1|x i , c j ), the probability of a sample being accepted by hypersphere M j , we can reformulate the two discriminative losses with the probability.
Contrastive type loss: This loss maximizes the likelihood of training positive samples being accepted or negative rejected.
Triplet type loss: The loss will punish when the log probability of a negative sample is greater than a positive sample with a margin τ .  (14) show the uni-modal case, for the multi-modal one, we have to consider m different centers {c 1 , . . . , c m } in the calculation of the two reformulated discriminative losses. Therefore, we propose two strategies as follows: where Max references only M j with the max logit output, while Mean takes account of all hyperspheres equally. Then, we can obtain the corresponding Contrastive and Triplet losses by substituting Eqs. (13) and (14) with the probability term (Eq. (15)).

Training Loss
The final training loss for the mSVDD with negative supervision can be formulated as: where γ adjusts between the mSVDD loss and the discrimination with negative supervision. In the training process, L Con|T ri will sum the loss from one batch samples with Eqs. (13) and (14). Algorithm 1 provides the training process for mSVDD with negative supervision in one epoch. Please see Section B in the appendix for more discussions on the relationship between mSVDD and the use of negative supervision.

Datasets and Implementation Details
Datasets Experiments were conducted on two datasets: 20 Newsgroups 5 and Reuters 6 , which have been commonly used in other one-class text classification work (Manevitz and Yousef, 2001;Ruff et al., 2019). We used the same pre-processing steps as the ones used in earlier work (Ruff et al., 2019), including lowercasing, removing stopwords, and tokenization. We used the external data for negative supervision in the absence of "real" labeled negative instances. We followed the similar logic for choosing our external data as the one in the field of pretrained word vectors, in which one general corpus, such as Wikipedia articles, is often adopted as the training dataset (Mikolov et al., 2013). So we also chose one publicly available corpus WikiText-2 (Merity et al., 2016), extracted from Wikipedia articles, as our external data. As shown in Algorithm 1, data loader loads one batch of negative samples, i.e., sentences from WikiText-2, which are labeled with 0.
Encoder For encoding the text input, i.e., φ(x, W ), we used a Bidirectional LSTM with attention (Hochreiter and Schmidhuber, 1997;Xu et al., 2015), with the number of hidden units being 150. For the pre-trained word embeddings, we experimented with GloVe Vectors (Pennington et al., 2014) and set the dimension to 300. In our experiments, we did not adopt the widely used BERT model (Devlin et al., 2019), as Ruff et al. (2019) showed that BERT model did not improve the performance.
Settings As for the optimization of parameters, Adam (Kingma and Ba, 2014) with a base learning rate of 0.001 was used for 50 epochs. The batch sizes were set to 32 and 64 for Reuters and Newsgroups, respectively. For the initialization of mSVDD model, we employed two operation steps.
In the absence of negative samples, mSVDD was first pre-trained on target samples by using an Au-toEncoder with two objectives: 1) warm-up and 2) reducing the reconstruction error for the target samples, such that the model could be more robust to noise or anomalous inputs (Jacobs, 1995;Hinton and Salakhutdinov, 2006). An AutoEncoder feed-forward network with a 0.5 compression rate, which consists of an encoder and a decoder, was put on the back of the BiLSTM feature network. Then, the weights of the m hyperspheres in mSVDD were initialized by running k-means clustering on the features learned before (Lloyd, 1982). As for the regularization term of mSVDD, c j was regularized (Ng, 2004), and a weight decay with 0.95 was applied for the parameters. As for the number of hyperspheres, different settings, 1, 3, 5, 10, were tested. For the hyperparameters, we set parameter s = 1.2 for scale, ν = 0.1, δ = −0.9 for the attention weight, τ = 0.1 for the triplet loss, = 1e−6 for norm, and γ = 1 for the training loss. The results were averaged over 10 runs with different random seeds.

Evaluation metrics
The performance was measured by the area under the receiver operating characteristics (ROC) curve (AUCs), a commonly used metric for one-class text classification (Manevitz and Yousef, 2001;Ruff et al., 2019). Table 1 shows the performance of mSVDD with different choices of m, i.e., the number of hyperspheres. Here, mSVDD(1) represents unimodal deep SVDD ( (Ruff et al., 2018)). The results show that: 1) As for the one-class version, mSVDD could provide better performances than the uni-modal one, especially when more hyperspheres were used. We can see that mSVDD (10), which uses the largest number of m, outperforms mSVDD(1) in more times than mSVDD(5) and mSVDD (3), that performs comparable with unimodal mSVDD. Similar results can also be observed in soft-boundary, mSVDD with more hyperspheres (10 or 5) won more times, nine out of thirteen cases in two datasets, than other settings. This proves the necessity of incorporating more hyperspheres to better describe the target data.

Results of mSVDD
2) While the performance of mSVDD did not improve linearly along with m, we can explain this from the following aspects. As for the model, mSVDD with more centers means that it has more Target mSVDD (1) mSVDD (3) mSVDD (5) mSVDD (10) Class One parameters and a complex model structure, which is hard to be optimized, especially on the data with a small training size (e.g., pol or rel.) As for the data, some data might have simple data distributions without the need for more modes. Another aspect would be the attention weights of multiple hyperspheres. (Ghafoori and Leckie, 2020) showed that focusing on some "good" hyperspheres would be beneficial rather than over all hyperspheres. In the calculation of attentions, we did not adjust δ so as to have a large weight for one specific hypersphere. This may cause limited improvements. We will compare mSVDD with DMSVDD later. Table 2 shows the performance of mSVDD trained with negative supervision and compares the results with the other methods. From the discussion in the last subsection, we used m = 3 in this subsection. To perform negative supervision for mSVDD, we evaluated four approaches where different losses and their reformulated probability forms were selected. For the method of DMSVDD, we report the results in the setting of the initial number of spheres K init = 10. As for the comparison between DMSVDD and mSVDD, DMSVDD puts value on one hypersphere and performs slightly better over mSVDD(3) in some cases (e.g., earn, acq and comp). This indicates one inspiration that discarding "bad" hyperspheres is sometimes necessary. For Reuters, the results indicate that mSVDD could benefit from the joint training of the discrimination losses, except for acq and ship. mSVDD with negative supervision also achieved the best scores in the four cases compared with other methods including DMSVDD.

Results of mSVDD with negative supervision
We have more obvious comparisons for 20 Newsgroup. All four negative supervision methods could improve mSVDD markedly and perform the best over all baselines for all target classes of 20 Newsgroup. For example, mSVDD with negative supervision could increase 2-3 points for comp. For different losses for negative supervision, the contrastive type loss, which has larger punishment over negative data, performs better than the triplet type loss, which uses a relatively small margin. Much more distinct improvements can be seen in the comparison with CVDD for rec or with OC-SVM for misc, while we obtained their best scores from Ruff et al. (2019). Further, the contrastive loss consistently outperformed other models including the baselines. In addition, the performance of Con+Max was greater than the Con+Mean strategy to reformulate the probability. We hypothesize that focusing on one of the hyperspheres is effective when we used mSVDD with the contrastive loss. Table 3 shows the results of CVDD with the proposed negative supervision for mSVDD. As mentioned in Section 3.3, CVDD can be seen as a special case of mSVDD. Therefore, the proposed negative supervision approaches to mSVDD can be also applied to CVDD theoretically. To highlight the usefulness of the negative supervision, we conducted the experiments to use the triplet loss with Max probability for CVDD. As for the implementation, since CVDD uses a different multi-head structure, we also used a different form to incorporate Triplet+Max to CVDD (See Section C in the appendix for the details of the implementation.). Overall, we can see that the proposed negative supervision could enhance CVDD in most cases on the two datasets. The overall performance mainly shows the following: 1) The improvement by the negative supervision to CVDD is consistent with mSVDD due to the similarity between the two. 2) The generality of the negative supervision can be shown, as Triple+Max was successfully applied to the different multi-head structure.

Results of CVDD with negative supervision
Regarding different target-classes, ship with the smaller training data size may cause worse perfor-  Table 2: mSVDD with negative supervision. AUCs in % on the Reuters (left part) and 20 Newsgroup (right part) datasets. For OC-SVM and CVDD, two baselines, we adopted their best scores from (Ruff et al., 2019). DSVDD and DMSVDD were our implementations. One and Soft mean One-class and Soft-boundary forms, respectively. +Triple+Max, which denotes mSVDD with Triplet loss with Max probability strategy, followed by three other negative supervision methods. In the rows of mSVDD Soft and mSVDD One, '+' following numbers means that there were improvements with negative supervision (three of four methods.) The best scores in each column are presented in bold, while the second best are underlined.  mance, so does real with CVDD(3), which are similar phenomena with mSVDD. In addition, the negative supervision could also prevent over-fitting for CVDD. For example, CVDD(3) with the minimal parameters achieved the best score for comp when varying "r" among 3, 5 and 10. In contrast, when the negative supervision was used, CVDD(10) with the maximal parameters attained the best and also performed better for all six target classes of the 20 Newsgroup dataset.

Conclusion
In this work, we proposed mSVDD, a new generic one-class text classification framework that uses multi-modal deep SVDD. Rather than the unimodal deep SVDD, mSVDD can enhance the description ability to the target one-class data with multiple hyperspheres. We also proved that this generic framework can include three variants, deep SVDD, DMSVDD, and CVDD under certain conditions. In addition, in the absence of "real" negative training data, we also proposed approaches for effectively adding negative supervision to further improve the performance of mSVDD. The experiments validated that the proposed mSVDD provides better performance compared to uni-modal SVDD.
The experiments also showed the further improvements in most cases when negative supervision was used for mSVDD and CVDD. For future work of this study, we will use some sampling strategies to improve the current work.
A Proofs of Proposition 1 Schölkopf et al. (2001) proved that, in single-class classification, ν is the upper bound of the fraction of anomalies, and the lower bound of the fraction of training samples being anomalies or on the optimal boundary. Ruff et al. (2018) proved that this νproperty still holds for uni-modal soft-boundary deep SVDD. Although the same proposition does not hold true for our multi-modal case as it is in general, it is still true when the attention weight α ij is constant for different hyperspheres. This will give us an intuition on the role of ν. Proposition 1. (ν-property ) 7 The hyperparamter ν ∈ (0, 1] in soft-boundary deep mSVDD holds if we set an equal attention weight to each hypersphere: i. ν is an upper bound on the fraction of outlier samples.
ii. ν is a lower bound on the fraction of training samples being rejected or on the optimal boundary.
Proof. Ad (i). For each training instance x i , its loss function is defined as hinge-loss: where f is the model with parameters. Let us define We also define R s = 1 m j R 2 j . And W.L.O.G, we also assume d 1 ... d n which means d n is n-th farthest sum distance. The number of outliers is given by n out = |{i|d i > R s }|. Rewrite the objective of soft-boundary deep mSVDD ( Eq. 5) as: Since the objective of mSVDD is to get a minimum R s , therefore 1 − nout νn should be positive, Thus, n out νn must hold in the training. It implies that at most νn outliers should be rejected.
Ad (ii). The optimal R * s has to hold the inequality n out νn. If R * s >= d n , then n out takes the minimum value of 0 which means the boundary includes all the samples. Since n out is increased as long as R s decreased. If n out take the maximum value of νn under condition (i), we can have the minimal R * s = d i * , where i * = n − n out means d i * is (n − n out )-th farthest distance. We 7 Rewrite this proposition in the main body define {x i |d i R * s } is the set of training samples being rejected (d i > R * s ) or on the optimal boundary (d i = R * s ). Then we have inequality: |{x i |d i R * s }| = |{x i |d i > R * s } ∪ {x i |d i = R * s }| n out + 1 νn. This implies that at least νn samples being rejected or just on the optimal boundary. Figure 2 shows an example with 10 training samples.

B Discussions on mSVDD with Negative Supervision
B.1 Relationship between mSVDD and the use of negative supervision mSVDD and negative supervision are not two independent sub-architectures. Negative supervision, including contrastive and triplet losses, are specially equipped to mSVDD. Specifically, these two components are closely connected by the center of the hypersphere, c j . Both mSVDD (Eq. 7 ) and negative supervision ( Eq. 13 or 14 ) contain c j . Since there is no real negative data, external data are used as pseudo negative samples to complete negative supervision. The use of negative supervision could improve the discrimination ability of mSVDD. In training, negative supervision loss forces mSVDD to reject unseen samples since real negative data in testing are also unseen in training. This improves inter-class discrepancy, compared with intra-class loss mSVDD optimized. However, in testing, the decision function will be the same as mSVDD trained with only positive samples.

B.2 Necessity of joint loss
In training loss of mSVDD with negative supervision( Eq. 16 ), L mSV DD aims to minimize the intra-class variations while L Con|T ri tries to maximize the inter-class discrimination. If γ in Eq. 16 is set to 0, it will train mSVDD only with target positive samples, where discriminative information could not be learned. On the other hand, if we use only L Con|T ri loss for training, it may result in large intra-target variations, especially when the triplet type loss is chosen, since it requires that only positive samples to be closer than pseudo negative samples. Additionally, because of the absence of real negative samples, it is another problem to sample the "appropriate" pseudo negative samples, such that the contrastive or triplet losses could fit our original objective, that is, learn a compact description boundary for the target one-class data. Therefore, it is necessary to jointly train with the loss of negative supervision.

C Implementation of CVDD with negative supervision
The proposed negative supervision methods can also be applied to CVDD. Now, we introduce our implementation of CVDD with triplet type loss and the Max probability strategy. CVDD uses a group of r context vectors C = (c 1 , ..., c r ) to describe the target one-class data, where c k ∈ R p . Given one context vector c k , ∀k ∈ {1, ..., r} and a pair of training positive and negative samples, we can get the reformulated probability form. First, CVDD maps a training sample x i to r heads of feature vectors S i = (s i1 , ..., s ir ). Then, we denote p(y i = 1|s ik , c k ) as the probability that k-th s ik reconstructs k-th context vector c k well.
p(y i = 1|s ik , c k ) = σ(ŝ ik Tĉ k ) And with triplet and Max probability strategy, we can define the negative supervision loss as: where τ is a margin. Then, L T ri can then be added to Eq. 8 to obtain the training loss.