Re-embedding Difficult Samples via Mutual Information Constrained Semantically Oversampling for Imbalanced Text Classification

Difficult samples of the minority class in imbalanced text classification are usually hard to be classified as they are embedded into an overlapping semantic region with the majority class. In this paper, we propose a Mutual Information constrained Semantically Oversampling framework (MISO) that can generate anchor instances to help the backbone network determine the re-embedding position of a non-overlapping representation for each difficult sample. MISO consists of (1) a semantic fusion module that learns entangled semantics among difficult and majority samples with an adaptive multi-head attention mechanism, (2) a mutual information loss that forces our model to learn new representations of entangled semantics in the non-overlapping region of the minority class, and (3) a coupled adversarial encoder-decoder that fine-tunes disentangled semantic representations to remain their correlations with the minority class, and then using these disentangled semantic representations to generate anchor instances for each difficult sample. Experiments on a variety of imbalanced text classification tasks demonstrate that anchor instances help classifiers achieve significant improvements over strong baselines.


Introduction
Data imbalance is a long-standing challenge in the text classification tasks such as sentiment analysis (Wu et al., 2018), intent detection (Quan et al., 2020) and spam detection (Liu et al., 2017), where the distribution of training data over classes is skewed. For example, the number of minority samples accounts for only 28% of training instances in SMS Spam dataset (Peng et al., 2019) and 14% in Opin-Rank dataset (Ganesan and Zhai, 2012). Data imbalanced issue is more severe in Toutiao dataset (Ouyang et al., 2020) with a minoritymajority ratio of 1:122 (hereafter imbalance ratio). * Corresponding author: xiaowangzhang@tju.edu.cn As the class distribution tends to be extremely imbalanced, texts from the minority class(es) may be easily categorized into the majority class(es) (He and Garcia, 2009;Fernández et al., 2018;Gao et al., 2020;Yang et al., 2020 Recent studies have shown that some minority samples, called difficult samples as they locate in the overlapping semantic region, are more important for imbalanced text classification than those far from the overlapping semantic region (Girshick et al., 2014;Robinson et al., 2020). As illustrated in Figure 1, the difficult samples have similar (entangled) embeddings with some majority samples in this overlapping semantic region as they are similar to these majority samples about surface forms (e.g., n-gram or syntax). For example, in Yelp.P dataset, a review "my parents didn't want to go back to beautiful Miami" is a difficult sample of the minority class. However, many words of this sample have also occurred in a positive review of the majority class "the beauty of Miami made Jessie reluctant to go back". Table 1 shows the percentages of difficult samples in several imbalanced datasets and the impact of difficult samples on classification performance of the strongest baseline (XLNet). Clearly, Classification errors mainly come from the misclassification of difficult samples. The most serious situation appears in AG_News (1%) dataset, where difficult samples account for 48.5% of minority samples, and XLNet only obtained 17.5% F1-score for these difficult samples.
The latest research on imbalanced learning separated the learning procedure into representation learning (i.e., the backbone network) and classification (i.e., the classifier), and achieved the stateof-the-art performance by freezing the backbone network and fine-tuning the classifier weights to obtain balanced decision boundaries (Kang et al., 2020). However, they ignore that entangled semantic representations of difficult samples make the decision boundaries hard to be clearly determined.
To this end, we propose to generate anchor instances, which have similar surface forms with difficult samples but be embedded in the nonoverlapping region of the minority class, to help the backbone network learn disentangled semantic representations for difficult samples. See Figure 1, consider the aforementioned difficult sample, two anchor-instance-generation steps are taken. First, entangled semantics of "beauty", "Miami" and "go back" in the difficult sample, are decoupled from the anchor instance. Second, semantics of "to my disappointment" and "help me" in some nondifficult samples of the minority class are injected into the anchor instance.
In order to make this generation framework feasible, we should answer the following three questions: (1) Given a difficult sample paired with a majority sample, how can we capture their entangled semantics? (2) How can we decouple and inject semantics from and into an anchor instance? (3) Merging anchor instances into the original data may change the data distribution and hence have a negative impact on non-difficult sample classification. How can we avoid this?
To address these problems, we propose a Mutual Information constrained Semantically Oversampling (MISO) approach with three essential components: a semantic fusion module (SFM), a mutual information (MI) loss, and a coupled adversarial generator (CAG) based on encoderdecoder networks. SFM is leveraged to adaptively find entangled semantics among difficult samples and majority samples in the overlapping region. Formally, we assume the majority and minority classes as two random variables A and B, their entangled semantics can be modeled as the mutual information between the two classes: . We introduce MI loss for parallel decoupling and injecting, which is symmetric and smooth. Semantic representations outputted from SFM constrained by MI loss are fed into CAG for generating anchor instances, with an adversarial strategy to ensure that the original data distribution is not destroyed.
In addition to the proposed MISO framework for imbalanced text classification, other contributions of our work can be summarized as follows. First, the boundary of the minority class learned by MI loss is theoretically proved as a (near-)optimal boundary. Second, we further theoretically show that the new distribution after adding anchor instances is consistent with the original distribution of the minority class (see proof.1 and proof.2 in Appendix). Third, experiment results demonstrate that text classifiers, trained on rebalanced datasets with anchor instances generated by MISO, outperform state-of-the-art methods by an average of 2.7% in nine datasets. The average success rate of moving difficult samples into non-overlapping region is 13.7%, which validates the effectiveness and robustness of MISO in handling difficult samples.

Related Work
Imbalanced Learning in NLP The re-sampling approach to this issue restores the balance of the class distribution by either undersampling the majority class or oversampling the minority class (Han et al., 2005;Chawla et al., 2002;. Cost-sensitive methods estimate the cost of samples with a cost matrix and train the classifier with different penalties (Gomez et al., 2000;McBride et al., 2019). Additionally, text style transfer with generative adversarial networks (GANs) has been used for oversampling, too (Fu et al., 2018;Guo et al., 2018;Nie et al., 2019). One advantage of these methods is that generated texts still follow the original data distribution. Kang et al. (2020) propose a long-tailed learning approach (τ -norm and cRT) to separate representation learning and classifier training. Chen et al. (2020) introduce MixText with TMix, a data augmentation method similar to Mixup used in computer vision, to interpolate new points in their corresponding hidden space. Lin et al. (2017) propose a soft sampling method that dynamically adjusts the weights of difficult samples by redefining the loss function. Dice loss that optimizes the Sørensen-Dice coefficient to immune the imbalance issue has also been proposed . Glazkova (2020) introduces ADASYN to assign a weight for each minority instance.

Difficult Sample Modeling in NLP
Difficult Sample Modeling in CV Difficult sample learning is one of fundamental issues in object dectection (Oksuz et al., 2019). Inspired by the view that difficult samples are usually with a high loss, several studies adopt a bootstrapping to mine difficult samples (Felzenszwalb et al., 2009;Ren et al., 2015). GANs are also used to generate difficult samples . Pang et al. (2019) propose a method based on Intersectionover-Unions to sample negative examples.
Our Proposal Significantly different from previous methods, our proposed MISO explores mutual information to decouple the overlapping between the majority and minority classes, which theoretically guarantees the consistency of class distribution after oversampling.

Problem Statement
Let X + := {x + 1 , ..., x + n + } ∈ R n + ×l be a training set of positive samples with the minority class distribution N + , X − := {x − 1 , ..., x − n − } ∈ R n − ×l be a training set of negative samples with the majority class distribution N − , where x i is the i-th sentence consisting of up to l tokens, n − and n + are the number of instances in the majority and minority classes, respectively. Data imbalance can be roughly divided into the slight imbalance (e.g., n + n − = 4 6 ) and the severe imbalance (e.g., n + n − = 1 100 or less) (He and Garcia, 2009;Brownlee, 2019). MISO learns a joint distribution Z for the majority and minority classes in the same semantic space. From this distribution, we sample Z := {z 1 , ..., z m } ∈ R m×d , which consists of m ∈ [0, n + × n − ] d-dimensional vectors.
The goal of MISO is to make Z close to N + but far from N − . In doing so, we generate a set of anchor instances Y + := {y + 1 , ..., y + t } ∈ R t×l with Z as their disentangled representations for difficult samples in X + , where t is the number of anchor instances. We further define a marginal distribution over Y + as U ψ + ,σ,ω , where ψ + , σ, ω are the parameters of a continuous and differentiable parametric function E ψ + (i.e., the minority encoder-decoder), SFM, and MI loss.

MISO
We introduce the overall architecture and then elaborate on each component of MISO in this section.

The Overall Architecture
As shown in Figure 2, MISO is built upon a coupled adversarial encoder-decoder framework that consists of two encoders together with two decoders (i.e., a latent variable-guided decoder and a standard one) and two discriminators. The two encoders are used to encode instances from the minority class (left encoder) and instances from the majority class (right encoder). To project instances from the two classes into the same semantic space, the two encoders share their parameters. MISO is equipped with two additional components: SFM and MI loss.
SFM captures the entangled semantics of difficult samples by extracting semantics of the minority class that is similar to those of the majority class (Step. 1 ). Learned entangled semantic representations are fused into a feedforward layer (Step. 2 ) and then fed into two Mutual Information Neural Estimators (MINEs) (Belghazi et al., 2018) (Step. 3 ). MI loss uses these MINEs to decouple entangled semantic representations from the majority class (Step. 4 ) and then feed these disentangled semantic representations into another feedforward layer (Step. 5 ). Specifically, MI loss minimizes the mutual information between entangled semantic representations and the minority class at the decoupling step and maximizes the mutual information between disentangled semantic representations and the majority class at the injecting step. In doing so, we move entangled semantic representations from the overlapping region into the non-overlapping region of the minority class, which are disentangled with the majority class.
Disentangled semantic representations are then fed into the minority class decoder (left decoder) to generate anchor instances, which are not hard to classify. The right decoder is used to generate instances of the majority class. Both decoders are monitored by two discriminators that adversarially detects whether the newly generated texts are the same as the original inputs in the surface forms.

Model Components
SFM In this module, we use a multi-head attention mechanism to learn the entangled semantic part of the input difficult samples.
Each attention head obtain initial semantic representations Q and K by calculating Ee ψ + | (x + ) and Ee ψ − | (x − ), where Ee ψ + | and Ee ψ − | are two encoders with their parameters ψ + | and ψ − |. Once we have Q and K, we can obtain entangled semantic representationsQ as follows: where τ = #majority samples #minority samples in the current epoch, so that τ ∈ [1, +∞) is an adaptive temperature parameter to control the scope of entangled semantics that currently needs to be extracted. In other words, in the initial epoch, entangled semantics of difficult samples are difficult to capture, so each difficult sample needs to be compared with (near-)entire majority samples. In the final epoch, the ability of SFM to extract entangled semantics is significantly enhanced, so each difficult sample only needs to be compared with partial majority samples, which have the stronger semantic similarity to this difficult sample. Finally, SFM obtains b s · h triples Q,Q, K after a feed-forward network, where b s is the size of minibatch, h is the number of attention heads. We denote the distributions of Q,Q, K as Q,Q and K respectively, and SFM as S σ with its parameters σ.

MI Loss
We propose to use mutual information to calculate semantic similarity because the loss value computed by the mutual information can obtain a (near-)optimal boundary of the minority class. The theoretical proof is shown in the Appendix. We first estimate the mutual information by two MINEs (Belghazi et al., 2018), T ω + with parameters ω + , and T ω − with parameters ω − . T ω + is an integrability function to estimate the KLdivergence between the joint distribution QQ, and the product of the marginals Q ⊗Q. T ω − is used to estimate the KL-divergence between the joint distribution KQ, and the product of the marginals K ⊗Q. Since KL-divergence can be approximated to a low-bound by its Donsker-Varadhan (DV) representation (Donsker and Varadhan, 1975), both of MINEs are represented as follows: We use MI loss to locally optimize SFM by minimizing I(K;Q) (i.e., decoupling entangled semantic representations from the overlapping semantic region) and maximizing I(Q;Q) (i.e., moving disentangled semantic representations into the nonoverlapping semantic region away from the majority class). Therefore, MI loss is defined as follows: CAG The coupled adversarial generator generates new texts from decoupled semantic representations and the original majority samples. The goal of CAG is to obtain anchor instances with similar surface forms to the input difficult samples, without destroying the original data distribution.
To this end, we make U ψ − and U ψ + ,σ,ω , match the prior distributions N − and N + , we introduce two discriminators D φ − and D φ + , each of which is composed of a single hidden layer (φ − and φ + denote parameters of the majority and minority discriminators). The reconstruction losses of the two decoders, De |ψ + with parameters | ψ + and De |ψ − with parameters | ψ − , are denoted as where Y − represents the outputs of the majority decoder from the majority input samples X − and E ψ − denotes the encoder-decoder for majority class with parameters Similarly, the minority encoder-decoder is denoted as E ψ + with its parameters ψ + ⊇ {ψ + | ∪ | ψ + }. The training objectives of the two encoder-decoders in CAG are defined as: where E • estimates the expectation over samples from the • distribution. Both the minority and majority decoders run in an autoregressive way to generate tokens. Taking the generation of an anchor instance as an example, anchor instance is generated as a sequence of l tokens y +

Training and Inference
Training Objective In summary, the goal of MISO is to use CAG to generate anchor instance via constructing Z as disentangled semantic representations by SFM jointly with MI loss, S σ • T ω . The final function is defined as follows: 2 for the definition of τ ) is a parameter to control contributions from the MI loss and reconstruction loss.
Training In order to calculate the mutual information, two MINEs need to be pre-trained before training the entire MISO. Furthermore, the warmup of the minority encoder is a necessary condition for pre-training MINEs, ensuring that the inputs of MINEs are reliable. Therefore, we first freeze SFM and the discriminators to pre-train the minority encoder-decoder. Secondly, we follow the method proposed by Belghazi et al. (2018) to pretrain MINEs. We have found that the training challenge lies in how to train MINEs when SFM is frozen. To solve this, we simulate the output of SFM using a trick. Notably, we concatenate K and Q to obtain a set of 2d-dimensional vectors and feed them into a feedforward neural network to obtain a set of d-dimensional vectorsQ. We useQ as the inputs of the decoders to participate in the pre-training of the encoder-decoder. Finally, we can use Q,Q, K as the inputs to pre-train their MINEs. The whole training process is shown in Algorithm 1 (lines 1-9).
Inference Once MISO is trained, we can use it to generate anchor instances for difficult samples of the minority class. Lines 10-12 in Algorithm 1 demonstrate the inference with MISO. This oversampling of anchor instances will not stop until the two classes are balanced. Update the parameters ψ of the two encoder-decoders and σ of SFM by descending their stochastic gradient:∇ ψ,σ L(ψ, σ, ω,φ); 8 Update the parameters ω of MINEs by descending its stochastic gradient:∇ωL(ω,ψ); 9 end 10 while n + + t ≤ n − do 11 Generating Y + with E+; 12 end

Experiments
We conducted experiments on several text classification tasks to examine the effectiveness of the proposed MISO against the previous state-of-theart imbalanced learning methods.

Baselines
• Focal and Dice loss (re-weighting methods). Lin et al. (2017) and  introduce algorithms to learn difficult samples by adjusting the weights of instances.
• MixText (data augmentation method). Chen et al. (2020) builds a semi-supervised learning model by interpolating text in the hidden space.
• ADASYN (re-sampling method). He et al. (2008) proposes an adaptive synthetic method that generates new examples for each minority instances according to the data distribution.
• τ -norm and cRT (long-tailed learning methods). Kang et al. (2020) decouple representation learning and classification so as to train the classifier to balance the decision boundary independently.
Note that we have chosen three types of classification models (i.e., TextCNN (Kim , 2014), TextRN-N (Liu et al., 2016), and XLNet (Yang et al., 2019)) to be combined with MISO to complete the entire text classification task. The backbone networks of these models are CNN, RNN, and Transformer, respectively. We hence term the combination of them with MISO as M-CNN, M-RNN, and M-XLNet.
Datasets In Table 2, we have selected 6 datasets: 3 imbalanced and 3 balanced datasets. Following by Ger and Klabjan (2019), we changed the balanced datasets into imbalanced datasets by random sampling one of the classes at 1% and 5% in each experiment, which is a common practice in imbalanced learning. Concretely, Opin-Rank contains hotel reviews on TripAdvisor and car reviews on Edmunds (Ganesan and Zhai, 2012). SMS Spam is created via Short Message Service (SMS) (Peng et al., 2019). Toutiao is a Chinese dataset that contains 15 topics (Ouyang et al., 2020). Yelp.P contains Yelp reviews about the best restaurants, shopping, nightlife, food, and entertainment (Li et al., 2018). IMDB is a movie review dataset (Ger and Klabjan, 2019). AG_News consists of news articles from the AG's corpus (Yang et al., 2019). For multi-class datasets (i.e.,Toutiao and AG_News), we treat all data as majority samples, except the data of the selected minority class.
Experiment Settings All experimental results were obtained as the mean of 5-fold crossvalidation. We set b s = 64, the learning rate as 1 × 10 −4 , d = 64, and h = 8. We removed stop words by using baidu stop words 1 for Chinese datasets and NLTK 3.5 stop words 2 for English  dataset. We selected "Jieba" to do word segmentation on Toutiao dataset 3 .
Evaluation Metrics we adopted F1 metrics (Yan et al., 2019) to evaluate all models.
Results Table 3 summarizes the results of MISO against other methods on each benchmark dataset. MISO achieves the best results on all datasets, suggesting that MISO is consistently effective across different data situations. Experiment results show that ADASYN, as a widely-used baseline for imbalanced learning, performs not well on imbalanced text classification. The main reason is that the discrete nature of texts results in ADASYN improperly synthesizing data that don't exist in the real world. This destroys the distribution of texts to some extent. Such a problem also appears in MixText. In contrast, MISO leverages CAG to keep the consistency between the new distribution and the original distribution.
Focal and Dice set larger learning weights for difficult samples. This is feasible when minority data is sufficient, and vice versa in Toutiao, IMD-B (1%) and AG_News (1%) datasets. Since MISO supplies anchor instances for the minority class, an average of 3.5% improvement can still be obtained in the case of data sparseness.
Following by Kang et al. (2020), we kept the backbone network (i.e., representation learning) frozen, and fine-tuned classifiers by class-balanced sampling (cRT) or decision boundary rectifying (τnorm). Neither of them considered the impact of difficult samples on searching clear decision boundaries. In contrast, MISO outperforms them by an average of 2.7%. This explicitly illustrates the necessity of re-embedding difficult samples.   In addition, MISO enables models based on CN-N or RNN, without pre-training, to outperform XL-Net in Opin-Rank, SMS Spam, and IMDB datasets, thus saving time and space for training.

Analysis
We carried out the statistical and empirical analysis to the superiority of MISO for re-embedding difficult samples.

Difficult Sample Re-embedding
We counted the number γ of majority samples in the k-nearest neighbors of each minority sample. If γ >0, the corresponding minority sample is considered as a difficult sample.
Results Table 4 shows statistics on the change of difficult samples before and after MISO is used on different datasets. The average decrease in the percentage of difficult samples on all datasets is 13.7% after re-embedding difficult samples. This illustrates that MISO can effectively transform the semantic representation of entangled difficult sample into a non-difficult version. Surprisingly, the F1score of difficult samples and non-difficult samples do not decrease, which suggests that re-embedding difficult samples and generating anchor instances do not make classifiers lose their ability to classify non-difficult samples. Intuitively, while the classification performances over difficult samples and non-difficult samples can be maintained, as some difficult samples become non-difficult samples, the overall classification performance will inevitably be improved.

Ablation Study
As mentioned above, decoupling entangled semantic representation of difficult sample from the majority class is achieved by SFM jointly with MI loss. Therefore, in order to verify the effectiveness of this method, we specially conducted ablation experiments: only using CAG to conduct the above comparative experiment (see Table 3) and analysis of difficult sample re-embedding (see Table 5).
Results Compared with the state-of-the-art methods, CAG has an average performance drop of 1.1%. This indicates that SFM constrained by the MI loss effectively improves the overall performance of classifiers with an average of 3.8%. Merely using CAG to generate new texts is actually a sample-balanced sampling, and shares with cRT in that they all ignore the decoupling of difficult samples. See Table 5, the percentages of difficult samples only decreases by an average of 1.1%. In Yelp.P (5%) and AG_News (1%), the per-  Just text ok to us and we'll credit your account. Just text ok to us that guarantee your bonus.

Opin-Rank
The sonata has a very smooth ride and great pick.
The sonata has ample acceleration because of our window setting. 3.These guys were rude and I really have a disappointing meal.

IMDB
A incredible story about a man who wants to figure out what really happened ...

AG_News
The manager said he left North London because he can not control recruitment.
The manager of the North London football club will be banned for the next seven years.

CAG (Ours)
1. I was worst ridiculous by worst restaurant. 2. The complimentary worst and oil was worst.
3. I angry their thing worst. 4. Its a bit session for us but worst once in a while. centages of difficult samples even increase by 0.2% and 0.6%, respectively, so that the newly added difficult samples will inevitably make the classification boundary more difficult to capture. In addition, based on the experimental results of Kang et al. (2020), the effect of class-balance sampling and decision boundary rectifying are slightly better than that of sample-balance sampling, which is the main reason that CAG is more inefficient than the state-of-the-art methods. It is important to note that CAG also degrades the classification performance, especially for non-difficult samples in SMS Spam, Toutiao, Yelp.P (1%) and AG_News (1%) datasets (see Table 5), however, this issue does not appear in MISO (see Table 4).

Case Study
We looked into our data to investigate how MISO generate anchor instances.
Results In Table 6, because of the imbalance problem, when all tokens in the spam SMS "Just text ok to us and we will credit your account" often appear in non-spam SMS with a high frequency, the classifier is misguided to categorize this sentence as non-spam SMS. To solve this issue, MISO generates anchor instance by adding tokens such as "guarantee" and "bonus". The anchor instance makes the backbone network re-embed this difficult sample in a non-overlapping form, which is more likely to be correctly classified as a spam SMS. A more interesting example appears in Yelp.P, where MISO seems to learn the semantic entailment of the original difficult sample, that is, "it was late" entails "guys are slow". These examples suggest that MISO is able to learn the underlying meaning of difficult samples and generate new samples that preserve the original meaning. We also conducted the ablation experiment of CAG for the case study. Due to the space limitations, we only show the Yelp.P (1%) example in Table 6. The repeated token "worst" in instances generated by CAG are usually meaningless. This reflects that without SFM and MI loss, CAG only learn the extremely limited semantics.

Conclusion
In this paper, we have presented an effective mutual information-constrained oversampling strategy, which re-embed difficult samples via a safe and robust method. Our method makes the traditional text classification still feasible when dealing with imbalanced data in the real world. In future work, we will try to design a more effective backbone network for re-embedding difficult samples.

Appendix Background
Mutual Information In probability theory and information theory, mutual information (MI) measures the interdependence between two random distributions. It can be used for estimating the similarity between the joint distribution and the product of marginal distributions. For convenience, we abbreviate P x∼X (x) as P (X), where X denotes any distribution of x ∈ X. The entropy of distribution X can be defined as (1) Given two probability distributions A and B taking values from finite sets A and B respectively, and x ∈ A, y ∈ B, the conditional entropy of A given B can be defined as (2) We formalize MI from the perspective of probability theory. We define the joint distribution of A and B is P AB (x, y). Then, the discrete probability version of the mutual information can be formalized as . (3) Specifically, the mutual information between A and B is the reduction in the uncertainty of A due to the knowledge of B (or vice versa). Therefore, it can be defined as According to the above definition, the mutual information satisfies the following properties: • Non-negativity (i.e., I(A; B) ≥ 0); • Symmetry (i.e., I(A; B) = I(B; A)).
In addition, its extremum property is: where |A| is the size of the set A.
Kullback-Leibler (KL) Divergence KL divergence can be viewed as a measure of "distance" or "dissimilarity" between distributions A and B, defined over a common alphabet X and written as D(A B). It measures the inefficiency of mistakenly assuming that the distribution of a source is B when the true distribution is A. Similarly, the definition of KL divergence can be defined by: Then, KL divergence satisfies: • Non-negativity (i.e., D(A B) ≥ 0); • Asymmetry (i.e., D(A B) = D(B A).
Variational Distance The variational distance (also known as the L 1 − distance) between two distributions A and B with X is defined by Thus, it satisfies: • Non-negativity (i.e., A − B ≥ 0); • Symmetry (i.e., A − B = B − A .
In addition, variational distance and KL divergence satisfy: which is referred to as Pinsker's inequality.
Learning the (Near-)optimal Boundary of the Minority Class Observe Therefore, ∆ is a concave with respect to the discriminator D. When d dD ∆ = 0, ∆ is the maximization. In other words, Then the optimal D * is computed as follows: )dx.

Distribution Consistency
Theorem 2 Under the same conditions as Theorem 1, the distribution captured by MISO is consistent with the original distribution of the minority class. Therefore,it is represented as follows: P (Q) − P (Q) = 0 as P (K|Q) → P (K) and P (Q|Q) → 1.
Proof. Firstly, we can obtain that minority and majority samples are mutually exclusive and independent of each other by combining the definition of conditional probability and classification problem, that is, P (QK) = P (Q)P (K), then we have P (Q|K) = P (QK) P (K) = P (Q).
Furthermore, by the fact that conditioning never decreases divergence,we have that is, ∀x, y, z ∈Q, there is This, along with (15)and (16) From theorem 1, when we get the target state, it satisfies P (Q|Q) = 1 and P (K|Q) = P (K).
This, together with (17) On the other hand, by the non-negativity of KL divergence, we know that D(Q|Q) ≥ 0.
From the Squeeze Theorem, it is obvious that, D(Q Q ) = 0 as P (K|Q) → P (K) and P (Q|Q) → 1.
In summary, the distribution captured by MISO is consistent with the prior distribution of the minority class.