Semi-Supervised Text Classification with Balanced Deep Representation Distributions

Semi-Supervised Text Classification (SSTC) mainly works under the spirit of self-training. They initialize the deep classifier by training over labeled texts; and then alternatively predict unlabeled texts as their pseudo-labels and train the deep classifier over the mixture of labeled and pseudo-labeled texts. Naturally, their performance is largely affected by the accuracy of pseudo-labels for unlabeled texts. Unfortunately, they often suffer from low accuracy because of the margin bias problem caused by the large difference between representation distributions of labels in SSTC. To alleviate this problem, we apply the angular margin loss, and perform Gaussian linear transformation to achieve balanced label angle variances, i.e., the variance of label angles of texts within the same label. More accuracy of predicted pseudo-labels can be achieved by constraining all label angle variances balanced, where they are estimated over both labeled and pseudo-labeled texts during self-training loops. With this insight, we propose a novel SSTC method, namely Semi-Supervised Text Classification with Balanced Deep representation Distributions (S2TC-BDD). To evaluate S2TC-BDD, we compare it against the state-of-the-art SSTC methods. Empirical results demonstrate the effectiveness of S2TC-BDD, especially when the labeled texts are scarce.


Introduction
Semi-Supervised Learning (SSL) refers to the paradigm of learning with labeled as well as unlabeled data to perform certain applications (van Engelen and Hoos, 2020). Especially, developing effective SSL models for classifying text data has long been a goal for the studies of natural language processing, because labeled texts are difficult to collect in many real-world scenarios. Formally, this research topic is termed as Semi-Supervised Text Classification (SSTC), which nowadays draws much attention from the community Gururangan et al., 2019;Chen et al., 2020).
To our knowledge, the most recent SSTC methods mainly borrow ideas from the successful patterns of supervised deep learning, such as pretraining and fine-tuning (Dai and Le, 2015;Howard and Ruder, 2018;Peters et al., 2018;Gururangan et al., 2019;Devlin et al., 2019). Generally, those methods perform deep representation learning on unlabeled texts followed by supervised learning on labeled texts. However, a drawback is that they separately learn from the labeled and unlabeled texts, where, specifically, the deep representations are trained without using the labeling information, resulting in potentially less discriminative representations as well as worse performance.
To avoid this problem, other SSTC methods combine the traditional spirit of self-training with deep learning, which jointly learn the deep representation and classifier using both labeled and unlabeled texts in a unified framework (Miyato et al., 2017(Miyato et al., , 2019Sachan et al., 2019;Xie et al., 2020;Chen et al., 2020). To be specific, this kind of methods initializes a deep classifier, e.g., BERT (Devlin et al., 2019) with Angular Margin (AM) loss (Wang et al., 2018), by training over labeled texts only; and then it alternatively predicts unlabeled texts as their pseudo-labels and trains the deep classifier over the mixture of labeled and pseudo-labeled texts. Accordingly, both labeled and unlabeled texts can directly contribute to the deep classifier training. Generally speaking, for deep self-training methods, one significant factor of performance is the accuracy of pseudo-labels for unlabeled texts. Unfortunately, they often suffer from low accuracy, where one major reason is the margin bias problem. To interpret this problem, we look around the AM loss with respect to the label angle, i.e., the angles between deep representations of texts and weight vectors of labels. For unlabeled texts, the pseudo-labels are predicted by only ranking the label angles, but neglecting the difference between label angle variances, i.e., the variance of label angles of texts within the same label, which might be much large in SSL as illustrated in Fig.1. In this context, the boundary of AM loss is actually not the optimal one, potentially resulting in lower accuracy for pseudo-labels (see Fig.2(a)).
To alleviate the aforementioned problem, we propose a novel SSTC method built on BERT with AM loss, namely Semi-Supervised Text Classification with Balanced Deep representation Distributions (S 2 TC-BDD). Most specifically, in S 2 TC-BDD, we suppose that the label angles are drawn from each label-specific Gaussian distribution. Therefore, for each text we can apply linear transformation operations to balance the label angle variances. This is equivalent to moving the boundary to the optimal one, so as to eliminate the margin bias (see examples in Fig.2(b)). We can estimate each label angle variance over both labeled and pseudo-labeled texts during the self-training loops. We evaluate the proposed S 2 TC-BDD method by comparing the most recent deep SSTC methods. Experimental results indicate the superior performance of S 2 TC-BDD even with very few labeled texts.

Related Work
The pre-training and fine-tuning framework has lately shown impressive effectiveness on a variety of tasks (Dai and Le, 2015;Radford et al., 2019a;Howard and Ruder, 2018;Peters et al., 2018;De-Figure 2: Let solid circle and triangle denote labeled positive and negative texts, and hollow ones denote corresponding unlabeled texts. (a) The large difference between label angle variances results in the margin bias. Many unlabeled texts (in red) can be misclassified. (b) Balancing the label angle variances can eliminate the margin bias. Best viewed in color. vlin et al., 2019;Yang et al., 2019;Chen et al., 2019;Akbik et al., 2019;Radford et al., 2019b;Brown et al., 2020;Chen et al., 2020). They mainly perform deep representation learning on generic data, followed by supervised learning for downstream tasks. Several SSTC methods are built on this framework (Dai and Le, 2015;Howard and Ruder, 2018;Peters et al., 2018;Gururangan et al., 2019;Devlin et al., 2019). For instance, the VAriational Methods for Pretraining In Resourcelimited Environments (VAMPIRE) (Gururangan et al., 2019) first pre-trains a Variational Auto-Encoder (VAE) model on unlabeled texts, and then trains a classifier on the augmentation representations of labeled texts computed by the pre-trained VAE. However, the VAE model is trained without using the labeling information, resulting in potentially less discriminative representations for labeled texts.
Recent works on SSTC mainly focus on deep self-training (Miyato et al., 2017;Sachan et al., 2019;Miyato et al., 2019;Xie et al., 2020;Chen et al., 2020), which can jointly learn deep representation and classifier using both labeled and unlabeled texts in a unified framework. It is implemented by performing an alternative process, in which the pseudo-labels of unlabeled texts are updated by the current deep classifier, and then the deep classifier is retrained over both labeled and pseudo-labeled texts. For example, the Virtual Adversarial Training (VAT) method (Miyato et al., 2017(Miyato et al., , 2019 follows the philosophy of making the classifier robust against random and local perturbation. It first generates the predictions of original texts with the current deep classifier and then trains the deep classifier by utilizing a consistency loss between the original predictions and the outputs of deep classifier over noise texts by applying local perturbations to the embeddings of original texts. Further, the work in (Sachan et al., 2019) combines maximum likelihood, adversarial training, virtual adversarial training, and entropy minimization in a unified objective. Furthermore, rather than applying local perturbations, Unsupervised Data Augmentation (UDA) (Xie et al., 2020) employs consistency loss between the predictions of unlabeled texts and corresponding augmented texts by data augmentation techniques such as back translations and tf-idf word replacements. The work  exploits cross-view training by matching the predictions of auxiliary prediction modules over the restricted views of unlabeled texts (e.g., only part of sentence) with ones of primary prediction module over the corresponding full views.
Orthogonal to the aforementioned self-training SSTC methods, our S 2 TC-BDD further considers the margin bias problem by balancing the label angle variances. This is beneficial for more accurate pseudo-labels for unlabeled texts, so as to boost the performance of SSTC tasks.

The Proposed S 2 TC-BDD Method
In this section, we describe the proposed deep selftraining SSTC method, namely Semi-Supervised Text Classification with Balanced Deep representation Distributions (S 2 TC-BDD).
Formulation of SSTC Consider a training dataset D consisting of a limited labeled text set and a large unlabeled text set D u = {x u j } j=Nu j=1 . Specifically, let x l i and x u j denote the word sequences of labeled and unlabeled texts, respectively; and let y l i ∈ {0, 1} K denote the corresponding one-hot label vector of x l i , where y l ik = 1 if the text is associated with the k-th label, or y l ik = 0 otherwise. We declare that N l , N u , and K denote the numbers of labeled texts, unlabeled texts and category labels, respectively. In this paper, we focus on the paradigm of inductive SSTC, whose goal is to learn a classifier from the training dataset D with both labeled and unlabeled texts. The important notations are described in Table 1.

Overview of S 2 TC-BDD
Overall speaking, our S 2 TC-BDD performs a selftraining procedure for SSTC. Given a training dataset, it first trains a fine-tuned deep classifier based on the pre-trained BERT model (Devlin et al.,  Word sequence of labeled text in D l x u Word sequence of unlabeled text in Du y l ∈ {0, 1} K One-hot label vector of labeled text 2019) with AM loss (Wang et al., 2018). During the self-training loops, we employ the current deep classifier to predict unlabeled texts as pseudolabels, and then update it over both labeled and pseudo-labeled texts. In particular, we develop a Balanced Deep representation Distribution (BDD) loss, aiming at more accurate pseudo-labels for unlabeled texts. The overall framework of S 2 TC-BDD is shown in Fig.3. We now present the important details of S 2 TC-BDD.
BDD Loss Formally, our BDD loss is extended from the AM loss (Wang et al., 2018). For clarity, we first describe the AM loss with respect to angles. Given a training example (x i , y i ), it can be formulated below: where φ denotes the model parameters, · 2 is the 2 -norm of vectors; f i and W k denote the deep representation of text x i and the weight vector of label k, respectively; θ ik is the angle between f i and W k ; s and m are the parameters used to control the rescaled norm and magnitude of cosine margin, respectively.
Reviewing Eq.1, we observe that it directly measures the loss by label angles of texts only. We kindly argue that it corresponds to nonoptimal decision boundary in SSTC, where the difference between label angle variances is much larger than supervised learning. To alleviate this problem, we suppose that the label angles are drawn from each label-specific Gaussian distri- bution {N (µ k , σ 2 k )} k=K k=1 . Thanks to the properties of Gaussian distribution, we can easily transfer them into the ones with balanced variances by performing the following linear transformations to the angles: where With these linear transformations {ψ k (·)} k=K k=1 , all angles become the samples from balanced angular distributions with the same variances, e.g., ψ k (θ ik ) ∼ N (µ k , σ 2 ). Accordingly, the angular loss of Eq.1 can be rewritten as the following BDD loss: Supervised Angular Loss Applying the BDD loss L bdd of Eq.4 to the labeled text set , we can formulate the following supervised angular loss: Unsupervised Angular Loss Under the selftraining paradigm, we form the loss with unlabeled texts and pseudo-labels. Specifically, we denote the pseudo-label as the output probability of the deep classifier. It is computed by normalizing {cos(ψ k (θ ik ))} k=K k=1 with the softmax function: For each unlabeled text x u i the pseudo-label distribution is given by p(k|x u i , φ) y u i with the fixed copy φ of the current model parameter φ during self-training loops. Besides, to avoid those pseudolabel distributions {y u i } Nu i=1 too uniform, we employ a sharpen function with a temperature T over them: where · 1 is the 1 -norm of vectors. When T → 0, the pseudo-label distribution tends to be the onehot vector.
Applying the BDD loss of Eq.4 to the unlabeled text set D u = {x u j } j=Nu j=1 and pseudo-label distributions {y u i } Nu i=1 , we can formulate the following unsupervised angular loss: Entropy Regularization Further, we employ the conditional entropy of p(y|x i , φ) as an additional regularization term: This conditional entropy regularization is introduced by (Grandvalet and Bengio, 2004), and also utilized in (Sajjadi et al., 2016;Miyato et al., 2019;Sachan et al., 2019). It also sharpens the output probability of the deep classifier.

Implementations of Label Angle Variances
In this section, we describe implementations of label angle variances. As mentioned before, what we concern is the estimations of angular distributions {N (µ k , σ 2 k )} k=K k=1 , where their draws are the angles between deep representations of texts and label pro- and {c k } k=K k=1 are estimated over both labeled and pseudo-labeled texts during self-training loops. In the following, we describe their learning processes in more detail.
Within the framework of stochastic optimization, we update the {(µ k , σ 2 k )} k=K k=1 and {c k } k=K k=1 perepoch. For convenience, we denote Ω as the index set of labeled and unlabeled texts in one epoch, {f i } i∈Ω and {y i } i∈Ω as the deep representations of texts and corresponding label or pseudo-label vectors (i.e., y l i or y u i ) in the current epoch, respectively.
Estimating Label Prototypes Given the current {f i } i∈Ω and {y i } i∈Ω , we calculate the label prototypes {c k } k=K k=1 by the weighted average of {f i } i∈Ω , formulated below: To avoid the misleading affect of some mislabeled texts, inspired by (Liu et al., 2020), we update {c k } k=K k=1 by employing the moving average with a learning rate γ: Estimating Label Angle Variances Given {f i } i∈Ω and {c k } k=K k=1 , the angles between them can be calculated by: (10) Accordingly, we can compute the estimations of {µ k } k=K k=1 and {σ 2 k } k=K k=1 as follows: Further, the moving average is also used to the updates below:

Experimental Settings
Datasets To conduct the experiments, we employ three widely used benchmark datasets for text classification: AG News (Zhang et al., 2015), Yelp (Zhang et al., 2015), and Yahoo (Chang et al., 2008). For all datasets, we form the unlabeled training set D u , labeled training set D l and development set by randomly drawing from the corresponding original training datasets, and utilize the original test sets for prediction evaluation. The dataset statistics and split information are described in Table 2.
Baseline Models To evaluate the effectiveness of S 2 TC-BDD, we choose five existing SSTC algorithms for comparison. The details of baseline methods are given below.
• BERT (Devlin et al., 2019): A supervised text classification method built on the pre-trained BERT-based-uncased model 1 and fine-tuned with the supervised softmax loss on labeled texts.
• BERT+AM: A semi-supervised text classification method built on the pre-trained BERTbased-uncased 1 and fine-tuned following the self-training spirit with the AM loss on both labeled and unlabeled texts.
• The code is available on the net. 3 In experiments, we utilize the default parameters, and generate the augmented unlabeled data by using FairSeq 4 with German as the intermediate language.
For S 2 TC-BDD, BERT, BERT+AM, VAT and UDA, we utilize BERT-based-uncased tokenizer to tokenize texts; average pooling over BERT-baseduncased model as text encoder to encode texts; and a two-layer MLP, whose hidden size and activation function are 128 and tanh respectively, as the classifier to predict labels. We set the max sentence length as 256 and remain the first 256 tokens for texts exceeding the length limit. For optimization, we utilize the Adam optimizer with learning rates of 5e-6 for BERT encoder and 1e-3 for MLP classifier. For BERT, we set the batch size of labeled tests as 8. For S 2 TC-BDD, BERT+AM, VAT and UDA, the batch sizes of labeled and unlabeled tests are 4 and 8, respectively. For all datasets, we iterate 20 epochs, where each one contains 200 inner loops. All experiments are carried on a Linux server with two NVIDIA GeForce RTX 2080Ti GPUs, Intel Xeon E5-2640 v4 CPU and 64G memory.
Parameter Settings For S 2 TC-BDD, in our experiments, its parameters are mostly set as: λ 1 =

Results
For all datasets, we perform each method with five random seeds, and report the average scores.

Varying Number of Labeled Texts
We first evaluate the classification performance of S 2 TC-BDD with different amounts of labeled texts. For all methods, we conduct the experiments by varying the number of labeled texts N l over the set {100, 1000, 10000} with the number of unlabeled texts N u = 20000 for AG News and Yelp, and N u = 40000 for Yahoo. The classification results of both Micro-F1 and Macro-F1 over all datasets are shown in Table 3, in which the best scores among all comparing baselines are highlighted in boldface. Generally speaking, our proposed S 2 TC-BDD outperforms the baselines in most cases. Across all datasets and evaluation metrics, S 2 TC-BDD ranks 1.1 in average. Several observations are made below.
• Comparing S 2 TC-BDD against baselines: First, we can observe that S 2 TC-BDD consistently dominates the pre-training methods (including BERT and VAMPIRE) on both Micro-F1 and Macro-F1 scores by a big margin, especially when labeled texts are scarce. For example, when N l = 100, the Macro-F1 scores of S 2 TC-BDD are even about 0.17, 0.26 and 0.24 higher than VAMPIRE on the datasets of AG News, Yelp and Yahoo, respectively. Second, when labeled texts are very scarce (i.e., when N l = 100), S 2 TC-BDD performs better than other self-training baseline methods (i.e., NB+EM, BERT+AM, VAT and UDA) on all datasets, e.g., for Micro-F1 about 0.08 higher than VAT on Yahoo. Otherwise, when labeled texts are large, S 2 TC-BDD can also achieve the competitive performance, even perform better across all datasets.
• Comparing S 2 TC-BDD against BERT+AM and BERT: Our S 2 TC-BDD method consistently outperforms BERT+AM and BERT across all datasets and metrics. For example, when N l = 100 the Micro-F1 scores of S 2 TC-BDD beat those of BERT+AM by 0.01 ∼ 0.03 and those of BERT by 0.03 ∼ 0.05 across all datasets. That is because S 2 TC-BDD employs both labeled and unlabeled texts for training and can predict more accurate pseudo-labels of unlabeled texts than BERT+AM, benefiting for the classifier training. This result is expected since S 2 TC-BDD performs a Gaussian linear transformation to balance the label angel variances, so as to eliminate the margin bias, leading to more accurate predicted pseudo-labels of unlabeled texts. Besides, these results empirically prove that unlabeled texts are beneficial to the classification performance.
• Comparing BERT based methods against NB+EM and VAMPIRE: All BERT based methods (i.e., BERT, BERT+AM, VAT, UDA and S 2 TC-BDD) consistently dominate baselines based on small models (i.e., NB+EM, VAMPIRE). For example, when N l = 10000, the Micro-F1 and Macro-F1 scores of BERT are about 0.03, 0.18 and 0.05 higher than those of NB+EM on the datasets of AG News, Yelp and Yahoo, respectively. The observation is expected because BERT is a bigger model, hence can extract more discriminative representations of texts than those from the VAE model used in VAMPIRE and tf-idfs used in NB+EM.

Varying Number of Unlabeled Texts
For NB+EM, BERT+AM, VAMPIRE, VAT, UDA and S 2 TC-BDD, we also perform the experiments with 100 labeled texts and varying the number of unlabeled texts N u over the set {0, 200, 2000, 20000} for AG News and Yelp, and {0, 400, 4000, 40000} for Yahoo. Note that VAM-PIRE needs unlabeled texts for pre-training, thus we omit the experiments for VAMPIRE with N u = 0. The classification results are reported in Table 4. Roughly, for all methods the classification  performance becomes better as the amount of unlabeled texts increasing. For instance, the Micro-F1 scores of S 2 TC-BDD on all datasets gain about 0.3 improvement as the number of unlabeled texts increasing. These results prove the effectiveness of unlabeled texts in riching the limited supervision from scarce labeled texts and improving the classification performance. Besides, an obvious observation is that the self-training methods (i.e., NB+EM, BERT+AM, VAT, UDA and S 2 TC-BDD) consistently outperform the pre-training method (i.e., VAMPIRE), especially when unlabeled texts are fewer. The possible reason is that the pretraining methods need more unlabeled texts for pre-training while the self-training methods do not have the requirement.

Ablation Study
We perform ablation studies by stripping each component each time to examine the effectiveness of each component in S 2 TC-BDD. Here, we denote BDD as balanced deep representation angular loss L bdd in Eq.4. Stripping BDD means that we replace the proposed loss L bdd with the AM loss L am in Eq.1. The results are displayed in Table 5. Overall, the classification performance will drop when removing any component of S 2 TC-BDD, suggesting that all parts make contributions to the final performance of S 2 TC-BDD. Besides, removing unlabeled texts brings the most significant drop of the performance. This result is expected because label angle variances approximated only with very scarce labeled texts will have lower accuracy, resulting in worse performance. Further, in contrast to entropy regularization, the performance after stripping BDD decrease more. Note that the difference between the proposed L bdd and L am is whether constraining the label angle variances to be balanced or not. This result indicates that the balanced constraint of label angle variances brings a better deep classifier as well as more accurate pseudolabels for unlabeled texts, especially when labeled texts are limited, and also empirically prove the

Efficiency Comparison
To evaluate the efficiency of our S 2 TC-BDD, we perform efficiency comparisons over BERT, BERT+AM and S 2 TC-BDD on all benchmark datasets. To be fair, for all methods and datasets we set the batch sizes of labeled and unlabeled texts to 4 and 8 respectively, and iterate 100 epochs, where each one consists of 200 inner loops. The average per-epoch running time results are shown in Table 6. Generally speaking, the per-epoch running time of our proposed S 2 TC-BDD is close to those of BERT and BERT+AM. This result means that Gaussian linear transformation and estimation of label angle variances in our S 2 TC-BDD only introduce very few computation costs. That is expected since they merely require very few simple linear operations, which are very efficient.

Conclusion
In this paper, we propose a novel self-training SSTC method, namely S 2 TC-BDD. Our S 2 TC-BDD addresses the margin bias problem in SSTC by balancing the label angle variances, i.e., the variance of label angles of texts within the same label. We estimate the label angle variances with both labeled and unlabeled texts during the self-training loops. To constrain the label angle variances to be balanced, we design several Gaussian linear transformations and incorporate them into a well established AM loss. Our S 2 TC-BDD empirically outperforms the existing SSTC baseline methods.