Leveraging Training Dynamics and Self-Training for Text Classiﬁcation

,

A popular category of SSL methods in text classification is self-training (McLachlan, 1975;Xie et al., 2020b;Rasmus et al., 2015;Scudder, 1965;Mukherjee and Hassan Awadallah, 2020), an iterative approach that uses a trained teacher model to produce pseudo-labels for unlabeled examples, then uses these labels to train a student model, followed by repeating the process with the student as a new teacher until a convergence criterion is met.The quality of the pseudo-labels in the teacherstudent framework is an important factor in the selftraining process.In supervised learning, noisy labels are problematic and can negatively impact the generalization performance (Zhang et al., 2017a) especially for deep neural networks, which can attain zero training error on any dataset (Zhang et al., 2016).This phenomenon applies to self-training as well since the student model predictions are optimized towards potentially noisy pseudo-labels.To address this drawback, popular self-training methods (Xie et al., 2020b) mask out examples that the teacher model is not confident about.However, relying only on the teacher's confidence in predictions can be problematic especially if the teacher model is not well calibrated (Guo et al., 2017) or has poor performance.
In this work, we investigate the impact of the pseudo-label quality over the performance of selftraining methods in text classification and show that designing more sophisticated quality assurance measures for the teacher pseudo-labels leads to an improvement in generalization performance.We, hence, propose a novel self-training method that leverages training dynamics to assess the adequacy of the teacher pseudo-labels.In a nutshell, instead of using only the teacher's current beliefs about an unlabeled example (i.e., the confidence) to decide if an example should be masked or not, our method also analyzes how pseudo-labeled examples behave during training.Specifically, we leverage Area Under the Margin (AUM) (Pleiss et al., 2020) from supervised learning, which captures the divergence between the annotated label and the predicted label during training.AUM is calculated as the average difference between the logit corresponding to the gold annotated label and the largest other logit.Prior work has shown that a low AUM score correlates well with an example being mislabeled.
Our approach, which we call AUM-ST, extends AUM to unlabeled data and provides a more robust and effective mechanism of identifying noisy pseudo-labels compared to the current approaches based only on confidence.For each unlabeled example, AUM-ST computes the average logit difference between the teacher pseudo-label and the largest other logit during training.An unlabeled example with low AUM indicates that there is a constant tension between its assigned (potentially incorrect) pseudo-label and the hidden true class.Our method therefore masks pseudo-labeled examples with low AUM.Critically, unlike the vanilla AUM, where the annotated labels are constant, in AUM-ST the pseudo-labels are variable and dependent on the teacher network.In each self-training iteration, as the self-training process progresses, the teacher starts generating more qualitative pseudolabels, which are then used to further improve the student.In a way, our AUM-ST can be viewed as a method to enforce a strict learning curriculum (Gong et al., 2016;Kervadec et al., 2019;Yu et al., 2020), where challenging unlabeled examples are not used until the teacher network is able to produce adequate pseudo-labels for them.
To show the effectiveness of our approach, we test it on a diverse range of text classification tasks, ranging from emotion detection and sentiment analysis to gramaticality and question classification.Notably, AUM-ST is extremely effective on all benchmarks in low resource settings, obtaining an average improvement in accuracy over a baseline BERT (Devlin et al., 2019)

Related Work
We first discuss related work on semi-supervised learning for text classification.Second, we zoom in to self-training, a type of SSL that is the core of our AUM-ST.Finally, we discuss approaches of learning with label noise.Semi-supervised Learning in NLP Semisupervised learning has attracted much attention in the NLP community (Gururangan et al., 2019b;Yang et al., 2015;Clark et al., 2018;Chen et al., 2020b;Yang et al., 2017;Chen et al., 2020b;Xie et al., 2020a;Mukherjee and Awadallah, 2020b), since unlabeled data is often much easier to acquire compared to labeled data.For example, Miyato et al. (2016) used adversarial perturbations to text in the word embedding space.Yang et al. (2019) used a hierarchy structure to propagate supervision from high-level labels to lower-level labels, while Clark et al. (2018) introduced cross-view training, where a model makes auxiliary predictions only seeing parts of the input text and is trained to match the predictions when given the entire input.Xie et al. (2020a) used data augmentations on unlabeled examples and trained the model to output the same predictions when fed clean or augmented versions of the same input.Mukherjee and Awadallah (2020b) introduced uncertainty estimates into self-training, a particular type of SSL where a teacher and a student model are iteratively trained using labeled and unlabeled data.Self-training is the core of our AUM-ST, hence we detail it further in the next paragraph.Self-Training Our AUM-ST approach builds upon previous works on self-training (Miyato et al., 2018;Sajjadi et al., 2016b;Laine and Aila, 2017;Tarvainen and Valpola, 2017;Berthelot et al., 2019b,a;Xie et al., 2020a;Lee et al., 2013;Sajjadi et al., 2016a;Rosenberg et al., 2005;Verma et al., 2021;Miyato et al., 2016;Chen et al., 2020a;Gururangan et al., 2019a;Zhang et al., 2017b;Izmailov et al., 2020;Sachan et al., 2019), but replaces the traditional confidence thresholding data filtering mechanism with a more effective approach that takes into account the training dynamics of unlabeled examples.
Self-training is an SSL method where a single model is repeatedly trained on both labeled and unlabeled data, until a convergence criterion is met.The model selects which unlabeled data to train on using its own predictions.Concretely, traditional self-training follows these steps: 1) Train a teacher model M on a labeled set L. 2) Use M to make predictions and obtain pseudo-labels on a set of unlabeled examples U . 3) Optionally, filter out unlabeled examples using a criterion.For example, in traditional self-training, unlabeled examples where the teacher confidence is not high enough are ignored.4) Use both the labeled set L and the generated pseudo-labeled set to train a new student model M .5) Continue to step 2 with the student as the new teacher (i.e., M ← M ).Learning with Label Noise Several approaches to achieve label noise robustness have been proposed.For example, Goldberger and Ben-Reuven (2016) proposed adding a noise layer in the neural network architecture, whose parameters can be learned for an accurate correct label estimation.Saxena et al. (2019) introduced a curriculum-learning approach that uses learnable data parameters and ranks the importance of examples in the learning process.These parameters are then leveraged to decide the data to use at different training stages.Liu and Guo (2020) on the other hand proposed to alter the loss function to make it more robust in the face of label noise.To this end, they introduced Peer Loss Functions, which evaluate predictions on both the samples at hand, as well as carefully automatically constructed peer samples.Other approaches designed techniques to accurately identify and eliminate potentially mislabeled instances (Brodley and Friedl, 1999;Pleiss et al., 2020).Our work builds on the latter approaches; we leverage Area Under the Margin (Bartlett et al., 2017;Pleiss et al., 2020;Elsayed et al., 2018;Jiang et al., 2018) and adapt it to our self-training setup.We emphasize that most of these methods for learning with label noise can be adapted to our setting, and these are potential future directions for our work.

Our method
In this section, we first provide background information on the vanilla AUM (Pleiss et al., 2020) metric.Next, we introduce our proposed AUM-ST and detail the various procedures that we used to improve its performance.

Background
We start by introducing Area Under the Margin (AUM) (Pleiss et al., 2020), a metric from supervised learning based on training dynamics that can characterize training examples with respect to their contribution to generalization.AUM is defined as the margin averaged across all training epochs T .Specifically, at an arbitrary epoch t ∈ T , the margin is: where M t (x, y) is the margin of example x with true label y, z y is the logit corresponding to the true label y, and max y!=i (z i ) is the largest other logit corresponding to label i not equal to y. Intuitively, the margin measures how different a true label is compared to a model's belief at each epoch t.Therefore, the AUM of x is defined as: where k is the size of U AU M 6: Use the student as a teacher and go back to Step 2 use the student as the new teacher and reiterate the process from Step 2. We also explored reusing the teacher (i.e., using a student initialized using the teacher network weights) to estimate the AUMs of unlabeled examples in Step 3 of our algorithm.However, we noticed a slight decrease in accuracy of 0.4%.
The main improvement of AUM-ST lies in the use of training dynamics to assess the quality of pseudo-labels of unlabeled examples.Based on this quality measure, AUM-ST successfully filters harmful pseudo-label noise and improves model performance.Using a strongly noised student (Step 5) is another important factor in our framework.We train our student to match the predictions on strongly augmented examples Π(x i ) to the teacher's predictions on weakly augmented examples π(x i ).The intuition of this design choice comes from recent work on semi-supervised learning in vision (Sohn et al., 2020;Zhang et al., 2021), which has successfully showed that this combination of weak and strong augmentations work extremely well in practice.Specifically, using weak augmentations to generate the pseudo-labels and computing the loss against strong augmentations enforces a type of consistency regularization that in our setup exposes the student to a more difficult environment, which leads to the student outperforming the teacher.
We use various approaches to obtain strongly augmented data.In our setup, our weak augmentations are created using synonym replacement (Kolomiyets et al., 2011) or SwitchOut (Wang et al., 2018), and strong augmentations are obtained by randomly performing Backtranslations using long chain lengths (> 5), SwitchOut and synonym replacements.We further discuss the impact of different types of augmentations in §5.4.Other Factors AUM-ST works better in practice using various additional factors: 1) Consistent with other state-of-the-art SSL frameworks (Sohn et al., 2020;Xie et al., 2020a,b), in Step 2 of the algorithm, we select unlabeled examples only if the teacher confidence is higher than a threshold value (i.e., 0.7 in AUM-ST).

Experiments and Results
In this section, we first introduce the eight benchmark text classification datasets used to evaluate AUM-ST ( §4.1).Second, we introduce weak and strong baselines ( §4.2) which we compare against our AUM-ST.Next, we detail our experimental setup ( §4.3) and conclude by presenting the results obtained on all datasets in low data regimes ( §4.4).

Datasets
We consider various text classification datasets to benchmark our AUM-ST self-training approach.We first experiment with the Stanford Sentiment Treebank (SST) (Socher et al., 2013).SST contains 11, 855 sentences from movie reviews, annotated with five sentiment labels: negative, somewhat negative, neutral, somewhat positive, and positive.First, we consider the binarized version of the SST dataset, called SST-2, where the examples with the negative and somewhat negative labels are merged into a negative class, and the examples with the somewhat positive and positive labels are merged into a positive class (with neutral class being removed).Second, we consider the fine-grained version SST-5, which uses all five labels.Next, we consider the IMDB (Maas et al., 2011) movie reviews dataset.While the SST dataset is annotated at sentence level, an important particularity of the IMDB dataset is that it is annotated at review level, containing significantly longer text sequences.
GoEmotions (Demszky et al., 2020) is a sentence-level multi-label dataset created using Reddit comments.Containing more than 58, 000 sentences annotated with 27 emotions and the neutral class, GoEmotions provides a great opportunity to study the expression of fine-grained emotions and to develop emotion classification models.We experiment both with the highly granular version of the dataset (27 emotions and the neutral class, denoted by GoEmotions-28 in our experiments) and the version of the dataset where the labels are clustered into the Ekman basic set of six emotions, namely anger, disgust, fear, joy, sadness, and surprise, denoted by GoEmotions-Ek.CancerEmo (Sosea and Caragea, 2020a) is a dataset annotated at sentence level with the eight basic Plutchik-8 (Plutchik, 1980) emotions.The data is collected from a cancer forum from an Online Health Community and contains 8, 500 total examples annotated with fine-grained emotions and 16, 500 sentences that express no emotions (the neutral class).
We also consider the task of question classification and experiment with TREC-6 , a dataset of 5452 examples where fact-based questions are divided into six broad semantic categories.Fi-nally, we test on the Corpus of Linuistic Acceptability (CoLA), a dataset composed of 10657 sentences from 23 linguistics publications, manually annotated by expert linguists for acceptability (i.e., grammaticality).

Baselines
Weak Baselines In this section, we present two weak self-training baselines, where we experiment with various approaches of selecting what unlabeled data to use during self-training (i.e., Step 4 in our AUM-ST).Our first approach, entitled RAND, chooses the unlabeled set of examples to use during training at random.The second method considered is CONF, which selects unlabeled examples only if the model confidence passes a pre-defined threshold.
Strong Baselines First, we experiment with Uncertainty-aware Self-training (UST) (Mukherjee and Awadallah, 2020a) as a strong baseline.UST incorporates uncertainty estimates into the standard self-training framework by adding a few highly effective changes.UST computes uncertainty estimates for all unlabeled examples by stochastically passing the examples from this set through the model multiple times, with dropout enabled before each layer.The approach subsequently uses these uncertainty estimates to select what unlabeled data to use.Concretely, the model not only favors unlabeled data where the teacher model is confident, but also enforces low entropy of the teacher predictions.Second, we experiment with UDA (Xie et al., 2020a).UDA leverages Backtranslation (Edunov et al., 2018) and uses a consistency loss to enforce the model predictions on unlabeled data to be invariant to input noise.

Experimental Setup
We evaluate the performance of our AUM-ST by varying the number of training examples on the eight text classification benchmark datasets presented above.On each dataset, we experiment with 20, 50, 100, and 200 examples per class, which we sample without replacement.The remaining examples are used as unlabeled data.We follow the exact evaluation metrics used in the works introducing the datasets: accuracy for SST-2, SST-5, IMDB, and TREC-6, macro F1 for GoEmotions and CancerEmo, and Matthews correlation for CoLA.In each setup, we also run our models five times, with different parameter intializations, and report the average results, as well as their standard deviations.Model-wise, all our experiments use the BERT (Devlin et al., 2019) base uncased as the backbone model and the HuggingFace Transformers (Wolf et al., 2020) library for the implementation.We use the AUM package provided by Pleiss et al. (2020) for the AUM estimation of unlabeled examples.We use the translation models provided by Tiedemann and Thottingal (2020) for backtranslation.We present the hyperparametrs of our model in Appendix A.

Results
We show the results obtained across the eight datasets in Table 1.Overall, we note that our approach is extremely effective, significantly outperforming strong baselines on all datasets.For example, AUM-ST pushes the accuracy over the strongest UDA baseline by 1.6% on SST-2 and 1.1% on SST-5 using 50 labels per class.Critically, using 100 examples per class on SST-2, AUM-ST obtains 93.1% accuracy, a considerable improvement of 9.6% over the baseline BERT model.Remarkably, on IMDB, we improve the accuracy over UDA by 4.3% with 20 examples per class and over the fully supervised BERT by 12.6%.We see consistent improvements on both GoEmotions-28 and GoEmotions-Ek datasets as well, where our method is particularly effective using 100 and 200 examples per class.Notably, our AUM-ST improves upon the baseline BERT model by 10% in F1 score using 200 examples per class on GoEmotions-28, and pushes the F1 score over UDA by 2.3% using the same amount of examples on the GoEmotions-Ek dataset.
We notice that on the TREC-6 question classification dataset, the UDA model slightly outperforms our AUM-ST using 200 examples per class, but lags behind in the other setups.For example, AUM-ST pushes the accuracy by 0.7% using 20 examples per class.Results on COLA also showcase the effectiveness of our methods, where we see an improvement of 1.4% in Matthews correlation over the strong UDA baseline and 6.8% over the fully supervised approach using 20 examples per class.
These results indicate that AUM-ST performs extremely well in low data regimes, and can be used effectively when training data is scarce.To this end, we emphasize that AUM-ST can considerably mitigate the annotation efforts needed to obtain good performance on the task at hand.

Ablation Study
While our augmentation techniques play an important part in our AUM-ST, we argue that the AUM filtering is a vital component of our framework.To this end, we perform an ablation study to verify that the improvements in performance do not come solely from the consistency loss of weak and strong augmentations.To this end, we retrain our strongest baseline UDA (Xie et al., 2020a) in all data regimes (20/50/100/200 labels per class) using the same augmentations as our AUM-ST on four datasets: IMDB, GoEmotions, TREC-6 and COLA.Concretely, this variation of UDA (denoted by UDA-AUG) minimizes the KL divergence between the predictions on weakly augmented unlabeled examples and the predictions on strongly augmented unlabeled examples.We show the results obtained in Table 2 where we observe that AUM-ST obtains steady improvements in performance over UDA of 1.2% on average.Critically, these results show that the improvements in performance come from our AUM-based filtering method, emphasizing its effectiveness.

AUM-ST when large labeled data is available
We evaluate AUM-ST in high-resource settings to verify if it performs well on large datasets.To this end, we first seek to collect additional unlabeled data to use alongside the provided training sets.
Unlabeled Data We collect in-domain unlabeled data (if not provided) for six out of the eight datasets presented previously: SST-2 and SST-5: We use the Kaggle Rotten Tomatoes corpus which contains more than one million reviews.
We employ the same preprocessing techniques as in the original paper (Socher et al., 2013) (e.g., splitting at sentence level).IMDB: The IMDB dataset (Maas et al., 2011) (Sosea and Caragea, 2020b).GoEmotions-28 and GoEmotions-Ek: Since the authors do not disclose the subreddits used for sampling their data, we resort to using a general Reddit dump (Henderson et al., 2019).We omit TREC-6 and CoLA in this experiment since additional unlabeled data for gramaticality or question classification is hard to find.
To enable reproducibility and spur further research into SSL techniques for text classification, we will make the collected unlabeled data available to the research community.

Results
We show the results obtained in highresource settings in Table 3. First, we observe that on SST-2, SST-5, and IMDB our weak baselines with unlabeled data do not bring any improvements over the fully supervised approach.Second, interestingly, while UST outperforms the base model on the SST-2 and IMDB datasets, the performance on the fine-grained SST-5 is extremely low.Finally, our AUM-ST is successful on all the datasets, improving upon the supervised model by 1.4% on average and outperforming all the strong baselines.
On GoEmotions-28 and GoEmotions-Ek (Demszky et al., 2020), our weak baselines, RAND and CONF marginally outperform the baseline BERT, improving the average F1 by 0.1% and 0.3%, respectively.Interestingly, UST performs poorly on this dataset, being outperformed by the trivial CONF Table 3: Comparison of different self-training methods using the entire training set and additional unlabeled data.We report the results in terms of accuracy on SST-2, SST-5, IMDB and macro F1 on GoEmotions-28, GoEmotions-Ek, and CancerEmo.
approach.We note that UDA performs the best among the baselines.However, our AUM-ST consistently outperforms other methods, and yields a 1.5% improvement in F1 score over the supervised model.On CancerEmo (Sosea and Caragea, 2020a), we note that UST performs much better with an improvement in F1 of 1.2% over the supervised classifier.Our AUM-ST approach is still the most successful, with a considerable 3.1% improvement over the baseline BERT, and 0.7% improvement over the strong UDA model.
While AUM-ST is particularly effective in low resource settings (as shown in §4.4), the results here also showcase the feasibility of our approach, which consistently outperforms all the other methods both in low-resource settings and high-resource (large labeled data) settings.

Unlabeled Data Impurity
As mentioned previously, noisy pseudo-labels can be detrimental to learning effective SSL models.In this section, we analyze the unlabeled data impurity (i.e., the fraction of unlabeled data which is incorrectly classified by our model) to compare the pseudo-label quality of various SSL methods against our AUM-ST.We emphasize that an effective SSL approach should aim to minimize impurity; low impurity indicates that the pseudo-labels of the teacher network are of high quality.We perform this analysis on the GoEmotions-28 dataset in a low data regime, using 200 examples per class.In this setup, since the unlabeled data is created from the original (labeled) training set, we can easily compute the unlabeled error rate.We show the impurity at the end of each self training iteration in Figure 1.Notably, at the end of the training process, AUM-ST improves the impurity by 1.4% over the UDA model, and by 3% over the UST method.Interestingly, we observe that the methods perform on-par with each other until Iteration 10, when the impurity of AUM-ST becomes lower than the other methods.

The Impact of Weak and Strong Augmentations
In this section, we analyze how our model performs when trained under various weak augmentations (π) and strong augmentations (Π) in our selftraining framework.We show in Table 4 the performance in terms of macro F1 of AUM-ST using various combinations of π and Π on the GoEmotions-28 dataset with 200 examples per class.Note that we experiment with every combination of π and Π (even combinations when π is a stronger augmentation than Π) in order to also analyze the behavior of our approach when using stronger augmentations to generate the teacher pseudo-labels.We consider in our analysis the following augmentation strategies: no augmentation (NoAug), synonym replacement (SynRepl) (Kolomiyets et al., 2011), SwitchOut (Wang et al., 2018), and BT-n, which denotes Backtranslation (Edunov et al., 2018) with a chain of length n and languages such as German, French and Italian.We also consider a combination of these augmentation strategies (e.g., backtranslation, synonym replacement, and SwitchOut).Interestingly, we can see from the table that using SwitchOut as π and a combination of Backtranslation with large n, Synonym Replacement and SwitchOut as Π yields the best results, improv- ing upon the fully supervised model by as much as 10% in F1.We can also see from the table that, π goes from weak augmentations to strong augmentations, the performance degrades compared to using low-noise weak augmentations.These results emphasize the importance of both weak and strong data augmentations in our AUM-ST, indicating that it is a vital component of our framework.

Computational Costs
In this section, we discuss the computational cost of our AUM-ST and how it compares with other methods.First, we note that AUM-ST trains an additional model compared to other teacher-student approaches such as CONF or UST to perform the AUM estimation (Step 3 of our algorithm).However, the computational costs incurred by this additional training step are not a serious issue in low resource settings.Even in setups with large amounts of both labeled and unlabeled data, our computational cost is not significantly higher than the other methods because our AUM-ST method converges in a lower number of steps despite that it encompasses an additional training stage.Concretely, AUM-ST is 15% more computationally expensive than the traditional pseudo-labeling (i.e., the CONF method).Moreover AUM-ST converges three times faster compared to UST (Mukherjee and Awadallah, 2020b) and twice as fast as UDA (Xie et al., 2020a).

Conclusion
We improve the traditional self-training framework through a novel Area Under the Margin unlabeled example selection technique, and show that our approach is effective in a wide range of text classifica-tion tasks.We studied our approach in various domains (social networks, forums, online platforms) and contexts (movie reviews, medical forum discussions, fact-based questions), and observed that our AUM-ST outperforms other strong self-training approaches.In the future, we plan to incorporate other approaches of learning under label noise into SSL frameworks such as self-training.

Limitations
This work shows that achieving good performance in text classification with limited labeled data is possible.Unfortunately, this is possible exclusively if there is easy access to unlabeled data.Moreover, while unlabeled data for some tasks is hard to obtain (as we found for TREC and CoLA datasets), we also emphasize that even in the presence of unlabeled data, its distribution can be mismatched with the labeled data distribution, which was shown to be particulary challenging to deal with in SSL (Coates et al., 2011).Our work does not study this scenario, however, we aim to further explore our method in this setting.
Pleiss et al. (2020) show that examples with low AUMs are ambiguous or tend to be mislabeled, and removing these examples can help the generalization performance.The vanilla AUM procedure can be summarized as follows: 1) Train a classifier and monitor the AUM of each training example; 2) Examples from the training set which have an AUM smaller than a threshold are considered mislabeled, hence are completely eliminated from the training set; and 3) Train a new classifier on the filtered training set.3.2 Proposed ApproachAUM-ST is a novel SSL approach that leverages the training dynamics of unlabeled examples to improve a model's performance.Algorithm 1 gives an overview of our AUM-ST.We first train a teacher model on weakly augmented labeled examples (Step 1) and use the trained teacher to make predictions and generate hard pseudo-labels for weakly augmented unlabeled examples (Step 2).Next, we monitor the training dynamics of these unlabeled data and their pseudo-labels (Step 3).Specifically, we characterize the unlabeled examples according to their contribution to model learning and generalization using AUM.Next, we filter out data with low AUM, since these examples are likely to hurt the generalization performance (Step 4).Then, we train a student model to be consistent with the teacher's predictions on unlabeled examples.Concretely, we train our student to minimize the combined cross-entropy on weakly augmented labeled examples and strongly-augmented, high-AUM unlabeled examples (Step 5).Finally, we Algorithm 1 Proposed AUM-ST Require: Labeled data L = {(x1, y1), (x2, y2), ...(xn, yn)}, unlabeled data U = {x1, x2, ...xm} and γ AUM threshold.1: Learn teacher model θ t on weakly noised labeled data minimizing the following cross entropy loss L θ t = 1 n n i=1 H(yi, p(y|π(xi); θ t )) 2: Use the weakly noised teacher model to generate hard pseudo labels for weakly augmented unlabeled examples ŷi = argmax(p(y|π(xi); θ t )), ∀i = 1, • • •, m 3: Train model θ AU M on weakly augmented training and unlabeled examples, and monitor the training dynamics of unlabeled examples over T epochs AU M (xi, ŷi) = 1 T T 1 [z ŷi − max ŷi !=j (zj)], where z ŷi and zj are the logits corresponding to the pseudolabel ŷi and the largest other logit produced by θ AU M 4: Rank and select high-AUM unlabeled examples U AU M = {(xi, ŷi) ∈ U | AU M (xi, ŷi) > γ }, 5: Train a student model θ s which minimizes the cross-entropy loss on weakly augmented labeled examples and strongly augmented high-AUM unlabeled examples.
2) In Step 3, we train θ AU M only on a subset of the unlabeled examples that pass the filtering from the previous step.While considering all the unlabeled examples might work when the labeled data is abundant, training a model with very few labeled examples and a lot of potentially noisy pseudo-labeled examples produces poor AUM estimations.3) We always balance the class distribution of the unlabeled examples.4) When training our model on both labeled and unlabeled examples (Step 5), our batches contain both labeled and unlabeled examples.The ratio of labeled to unlabeled examples is constant across all batches and set to 1 : 7 (i.e., each batch contains seven times as many unlabeled examples as labeled examples).

Figure 1 :
Figure 1: Comparison of impurity between the UST, UDA, and our AUM-ST model on the GoEmotions-28 dataset with 200 examples per class.
model of 3.5% using 200 examples per class and 8.3% improvement using as few as 20 examples per class.

Table 2 :
Ablation study of our AUM-ST.
already contains an unlabeled set of examples provided by the authors, hence we use it in our experiments.CancerEmo: We use the same discussion boards from the Cancer Survivors Network used in the work introducing the dataset

Table 4 :
Performance using various weak augmentations π and strong augmentations Π on the GoEmotions-28 dataset using 200 examples per class.