JointMatch: A Unified Approach for Diverse and Collaborative Pseudo-Labeling to Semi-Supervised Text Classification

Semi-supervised text classification (SSTC) has gained increasing attention due to its ability to leverage unlabeled data. However, existing approaches based on pseudo-labeling suffer from the issues of pseudo-label bias and error accumulation. In this paper, we propose JointMatch, a holistic approach for SSTC that addresses these challenges by unifying ideas from recent semi-supervised learning and the task of learning with noise. JointMatch adaptively adjusts classwise thresholds based on the learning status of different classes to mitigate model bias towards current easy classes. Additionally, JointMatch alleviates error accumulation by utilizing two differently initialized networks to teach each other in a cross-labeling manner. To maintain divergence between the two networks for mutual learning, we introduce a strategy that weighs more disagreement data while also allowing the utilization of high-quality agreement data for training. Experimental results on benchmark datasets demonstrate the superior performance of JointMatch, achieving a significant 5.13% improvement on average. Notably, JointMatch delivers impressive results even in the extremely-scarce-label setting, obtaining 86% accuracy on AG News with only 5 labels per class. We make our code available at https://github.com/HenryPengZou/JointMatch.


Introduction
The success of deep learning models often heavily depends on the availability of large amounts of labeled data (He et al., 2016;Vaswani et al., 2017).However, the labeled data for many tasks are often expensive, difficult, and time-consuming to obtain.By contrast, acquiring unlabeled data is more costeffective and convenient in many scenarios.This has led to a surge of interest in semi-supervised learning, which aims to enhance learning performance with limited labeled samples by leveraging large amounts of unlabeled data (Berthelot et al., 2019;Sohn et al., 2020).
Recently, the combination of pseudo-labeling and consistency regularization has become a popular paradigm for semi-supervised learning (Sohn et al., 2020;Zhang et al., 2021;Sosea and Caragea, 2022).Pseudo-labeling (Lee et al., 2013) uses a fixed threshold to select the model's highconfidence predictions as pseudo-labels for further training, whereas consistency regularization (Sajjadi et al., 2016) enforces the model to make similar predictions for perturbed versions of the same data.For example, UDA (Xie et al., 2020) applies strong data augmentations, such as back-translation, to unlabeled data and minimizes the divergence between model predictions of input and its augmented views.FixMatch (Sohn et al., 2020) uses the pseudo-label generated from weakly-augmented unlabeled data to supervise the strongly augmented version of the same data.SAT (Chen et al., 2022) improves Fix-Match by training a meta-learner to re-rank different augmentations based on their similarities with the original data.These methods require a pre-defined high-confidence threshold to generate high-quality pseudo-labels for good performance.However, there are some potential limitations: (1) Setting a fixed threshold for pseudo-label selection neglects the varied difficulties of learning different classes (Zhang et al., 2021;Wang et al., 2023b,a).This can cause the model bias toward easy classes, as more pseudo-labels will be generated for current easy classes (see Figure 1a); and (2) If these pseudo-labels are incorrect and used to train the model, the model can be worse and produce more inaccurate pseudo-labels, progressively accumulating its error and degenerating its performance (see Figure 1c) (Arazo et al., 2020).
To address these issues, we propose Joint-Match, a diverse and collaborative pseudo-labeling approach for semi-supervised text classification (SSTC).JointMatch is a holistic framework that unifies ideas from recent semi-supervised learning and the task of learning with noise for SSTC.Inspired by FlexMatch (Zhang et al., 2021) and FreeMatch (Wang et al., 2023b), we adaptively adjust classwise thresholds based on the estimated learning status for different classes at different times.This enables difficult classes at the current iteration to have lower local thresholds, thereby facilitating more pseudo-labels to be produced and used for learning these classes (see Figure 1b).Following the idea of co-training (Blum and Mitchell, 1998), JointMatch simultaneously trains two differently initialized networks and uses them to teach each other in a cross-labeling manner to alleviate error accumulation, as different networks can filter different types of noise.
Nevertheless, with the increase of training iterations, those networks will slowly converge and reduce to one network and thus will again suffer from the issue of error accumulation.Inspired by Co-Teaching+ (Yu et al., 2019) and Decoupling (Malach and Shalev-Shwartz, 2017), we propose to give more weight to disagreement data to keep the two networks diverged while also allowing the utilization of high-quality agreement data for training.The relationship and difference between our JointMatch and related techniques are discussed in detail in Section 2.5.
We evaluate the proposed JointMatch on three commonly studied SSTC benchmark datasets.Experimental results indicate the superior performance of JointMatch, obtaining a significant 5.13% average improvement over the latest work SAT.We also analyze the performance of JointMatch with varying numbers of labeled data.The results show that JointMatch can deliver impressive results even in the extremely-scarce-labels setting, achieving 86% accuracy on AG News with only 5 labels per class.We provide comprehensive ablation studies and analysis to understand each part of JointMatch.

Overview
In this section, we introduce our proposed Joint-Match for semi-supervised text classification.Joint-Match is a unified approach that integrates ideas and components from recent semi-supervised learning and the task of learning with noise.Specifically, we utilize three key techniques, i.e, (i) adaptive local thresholding; (ii) cross-labeling; (iii) weighted disagreement & agreement update, to address the limitations we have identified: (a) bias towards easy classes; (b) error accumulation; (c) tradeoff between divergence & consensus.
The main pipeline for JointMatch is shown in Figure 2.For each batch of unlabeled data U, we first apply both weak data augmentation α(•) and strong data augmentation A(•), such as synonym replacement and back translation, respectively.The weakly augmented data are then forwarded to two differently initialized models f and g to make predictions.High-confidence predictions that pass the adaptive local threshold τ t (c) are selected as pseudo-labels.Especially, the pseudo-labels generated by one model are used to teach its peer network, i.e., cross-labeling.The unlabeled loss L u is computed between generated pseudo-labels and model predictions of strongly augmented data.Here, we weigh more disagreement data (where both networks have different label predictions) to keep the two networks diverged but also allow the utilization of agreement data that are more likely  Weakly augmented data are first fed into two differently initialized models.Predictions with confidence surpassing adaptive local thresholds are selected as pseudo labels for training.Notably, pseudo labels from one model are used to teach its peer network, i.e, cross-labeling.Unlabeled data loss is then computed between pseudo labels from weakly-augmented data and model predictions from strongly augmented data.Here, we weigh more disagreement data to keep the two models diverged for effective mutual learning.to be correct.A complete algorithm for JointMatch is presented in Algorithm 1. Next, we explain our key components in detail.

Adaptive Local Thresholding
Pseudo-labeling-based semi-supervised text classification algorithms use a fixed threshold to select high-confidence unlabeled data as pseudo-labels for training.However, this ignores the varied difficulties of learning different classes at different time steps, and may result in that easier classes can generate more pseudo-labels than hard classes for learning and can cause increasing model bias along training (see Figure 1a).Inspired by Flex-Match (Zhang et al., 2021) and FreeMatch (Wang et al., 2023b), instead of using a fixed threshold, our JointMatch estimates the learning status of different classes and adaptively adjusts the local thresholds for different classes to produce more diversified pseudo-labels (see Figure 1b).Specifically, at each time step t, we first estimate the classwise learning status pt = [p t (1), pt (2), ..., pt (C)] via the exponential moving average of model's predicted probability on unlabeled data: where p0 = [1/C, 1/C, ..., 1/C], C is the number of classes, B is the batch size of labeled data, µ is the ratio of unlabeled data to labeled data, λ ∈ (0, 1) is the momentum parameter, q b = p m (α(u b )) denotes model's predicted class distribution on weak-augmented unlabeled data.Then we normalize the estimated learning status and adjust the local threshold for each class c from the pre-defined threshold τ : By doing this, difficult classes at the current iteration will have lower local thresholds, encouraging more pseudo-labels to be generated and utilized for training these classes.

Cross-Labeling
Another issue of pseudo-labeling, or more generally, self-training, is error accumulation: If the generated predictions are incorrect and the model is trained on them, the model can become worse and worse, continually producing more noisy pseudo-labels and accumulating its error (see Figure 1c).Inspired by Co-Training (Blum and Mitchell, 1998), our JointMatch involves the simultaneous training of two networks with different initializations.These networks are utilized to mutually instruct each other through cross-labeling.This strategy mitigates error accumulation because different networks can filter out different noises.More formally, given two differently initialized networks f and g, each network first selects its own high-confidence predictions of unlabeled data as pseudo-labels for the other network.The selected pseudo-labels are then used to compute unlabeled data loss L u to update the parameters for its peer network: where B is the batch size of labeled data, µ is the ratio of unlabeled data to labeled data, unsupervised loss weight w u , EMA decay λ, fixed threshold τ , disagreement weight δ, weak and strong data augmentations α(•), A(•).2: for model f , g do 3: denotes the generated hard label} where q b and Q b denote model's predicted class distributions on weakly and strongly augmented data, respectively, qb is the one-hot label that is converted from q b , and H refers to cross-entropy loss.Compared to self-training, this approach displays a zigzag-shaped error flow, which helps to avoid direct error accumulation within a single network.

Weighted Disagreement & Agreement Update
Two differently initialized networks can have varied learning abilities to filter different types of error in the initial training stage, allowing them to learn from its peer.However, as the training goes on, those networks will gradually converge, and the co-training approach will degenerate to selftraining and thus again suffer from error accumulation.To address this problem, Decoupling (Malach and Shalev-Shwartz, 2017) and Co-Teaching+ (Yu et al., 2019) propose to update networks only by data with disagreed network predictions and find that this strategy is effective in keeping two networks diverged.Nevertheless, their approaches completely ignore the agreement data, where both networks have the same prediction.We argue that those agreement data are a valuable learning signal, as they are more likely to receive correct pseudo-labels and should also be utilized.To this end, we propose to compute a loss weight w b for each unlabeled data u b that weighs more disagreement data to keep two networks diverged while also allowing the utilization of agreement data: where δ ∈ (0.5, 1) is the disagreement weight.The unlabeled data loss L u thus becomes: Our experiment result in Section 4.2 shows that this approach is more effective than the method using (i) only disagreement data; (2) only agreement data; (3) weighting both kinds of data equally.Note that this method can be further improved by adaptively adjusting the disagreement weight based on the disagreement degree of two networks or based on

Semi-Supervised
Fully-Supervised SSTC Table 1: Comparison of closely related techniques with our JointMatch approach.In the first column, "Pseudo-Labeling": select high-confidence predictions as pseudo-labels for training; "Weak-Strong Augmentation": use the pseudo-labels generated by weakly-augmented sample to supervise the strongly-augmented sample; "Disagreement Update": update networks by only samples receives different pseudo-label from the two networks.
different time steps for curriculum learning.However, this topic is outside the focus and scope of this paper, and we leave it to interested researchers for future exploration.The overall objective for training one network in JointMatch is: where w u represents the loss weight for unlabeled data loss L u and the supervised loss L s is given by:

Relation to Other Approaches
In this section, we compare our JointMatch algorithm with closely related approaches in Table 1.
Our goal is to identify the connections among them and point out the key techniques that make Joint-Match effective in semi-supervised text classification.FixMatch (Sohn et al., 2020) and SAT (Chen et al., 2022) employ the consistency regularization between weak and strong augmentation with pseudo-labeling to leverage unlabeled data.However, this approach uses a fixed threshold for pseudo-labeling, and often suffers from model bias towards easy class and error accumulation along their training.FlexMatch (Zhang et al., 2021) and FreeMatch (Wang et al., 2023b) propose to adaptively adjust local thresholds based on the estimated learning status of each class, which alleviates the bias towards easy classes.To address the issue of error accumulation, Co-Training (Blum and Mitchell, 1998) trains two networks simultaneously and supervises each network by the generated pseudolabels from its peer network.
Decoupling (Malach and Shalev-Shwartz, 2017) and Co-Teaching+ (Yu et al., 2019) observe that update by disagreement data can make two networks in Co-Training diverged and thus more effective.However, when updating their network, they completely ignore agreement data, which is also more likely to receive correct pseudo-labels in semi-supervised learning.Therefore, we propose utilizing both disagreement and agreement data for training, but weighting disagreement data higher to keep the two networks diverged while also exploiting the high-quality pseudo-labels from agreement data.
Our JoinMatch framework unifies ideas from the approaches described above into a single, effective framework for semi-supervised text classification.The three key techniques in JointMatch are: (i) adaptive local thresholding; (ii) cross-labeling; and (iii) weighted disagreement & agreement update.
Implementation Details.Following (Chen et al., 2022(Chen et al., , 2020)), we use the BERT-based-uncased model as our backbone model and the Hugging-Face Transformers (Wolf et al., 2020) library for the implementation.We adopt the same data augmentation techniques for fair comparisons, i.e., synonym replacement for weak augmentation and back-translation for strong augmentation, in all baselines.In detail, for back translation, we translate texts into German and then translate them back into English; for synonym replacement, we randomly substitute 30% of words with WordNet synonyms.A complete list of our hyper-parameters is provided in Appendix A, and our code is released.

Comparison with Baselines
We summarize the comparison with baselines on different text classification datasets in  The results are averaged over 5 runs with different model parameter initialization.We observe that our JointMatch consistently demonstrates the best performances across all 3 datasets, surpassing the best baseline by 5.13% on average.On the most challenging Yahoo!Answer dataset, JointMatch also significantly outperforms the best baseline by 5.25% accuracy and 5.13% macro-F1.These substantial improvements indicate the effectiveness of our diverse and collaborative pseudo-labeling framework JoinMatch.We further provide detailed ablation studies and analysis in the next section.

Varying the Number of Labeled Data
We also conduct experiments with varying numbers of labeled data to demonstrate the accuracy improvement of our JointMatch over FixMatch on Yahoo!Answers and AG News datasets.To ensure a fair comparison, we also include the performance of the vanilla ensemble of FixMatch, where we train two FixMatch models with different initializations and average their predicted probability distributions to obtain the result for the ensemble model.As shown in Table 4 and Table 5, Joint-Match outperforms the vanilla ensemble of Fix-Match with different numbers of labeled data on both datasets.This indicates that our improvement does not come simply from the model ensemble.
Noteworthy is that, JointMatch offers remarkable improvements over this vanilla ensemble model, especially with extremely limited labeled data: 7% on Yahoo!Answer with 5 labels per class and surprisingly 13.99% on AG News with 5 labels per class.This further validates the effectiveness of JointMatch in utilizing unlabeled data and demonstrates its capability and potential to be used in real-world scenarios.

Generalizability Results
To show the generalizability of JointMatch, we add experiments on three datasets with a larger number of classes (c=32, 27, 8).include one more recently published and competitive semi-supervised learning method SoftMatch (Chen et al., 2023) for comparison.JointMatch consistently delivers improvement over other methods, indicating that our approach also works well in settings with a larger number of classes.Table 7 provides the statistics and split information of the added datasets.We provide short descriptions of these datasets in Appendix C.

Effectiveness of Each Component
To show the effectiveness of each component in JointMatch, we also measure the accuracy performance of JointMatch after removing different parts in Table 8.We observe that the performance decreases after stripping each component, suggesting that all components in JointMatch contribute to the final performance.The performance of JointMatch drops most significantly after removing cross-labeling on both datasets, justifying the benefits of two models teaching each other and its help in alleviating error accumulation.Adaptive local thresholding performs a similarly vital role in the final performance of JointMatch.This indicates the effectiveness of adjusting local thresholds based on current classwise learning status.
Removing disagree weights, i.e., weighted disagreement and agreement updates, also gives a performance drop, although smaller compared to the other two components.We hypothesize that the reason for this is that the current fixed disagreement weights are not optimal.They should be adaptively adjusted based on the degree of disagreement between the two networks at different times.This topic is outside the focus and scope of this paper and we leave it for future exploration.Despite this, we show the benefit of disagreement weights and provide an ablation study on its values in the next section.Table 9: Accuracy on AG News with 10 labels per class and various disagreement weights.Updating networks by only agreement samples (δ = 0) and only disagreement samples (δ = 1) perform worse than updating by both types of data.Weighting more disagreement samples further delivers better results.

Influence of Disagreement Weights
We now analyze the benefits of the weighted disagreement & agreement update.Table 9 shows how our approach performs with different disagreement weights δ on AG News with 10 labels per class.Note that (1) δ = 0 means updating networks by only agreement samples, δ = 1 means updating by only disagreement samples and δ = 0.5 means updating by both types of samples and weighting them equally; (2) All samples used for updating networks are firstly high confidence samples.We observe that updating by either only agreement samples or only disagreement samples performs worse than updating by both kinds of samples.This validates our assumption both agreement samples and disagreement samples are beneficial for training, as agreement samples provide highquality pseudo-labels while disagreement samples offer sample diversity and keep the two networks diverse.In addition, weighting more disagreement samples can obtain better results than equal weighting, which emphasizes the importance of keeping the two networks diverged and thus enabling them to learn from each other.

Quantity & Quality of Pseudo Labels
JointMatch improves both the quality and quantity of pseudo labels used for training.As shown in Figure 3a, FixMatch gradually leverages more pseudo labels as training advances, ultimately utilizing about 80% of the available pseudo labels.However, the cost is that the accuracy of pseudo labels keeps dropping, eventually falling below 40%, as indicated in Figure 3b.This is consistent with our motivation in Section 2.3 that directly using noisy pseudo labels will cause models to accumulate errors and gradually produce even more noisy pseudo labels.On the contrary, JointMatch maintains a high accuracy of pseudo labels throughout the entire training process (Figure 3b), which validates the effectiveness of cross-labeling in preventing error accumulation.Furthermore, JoinMatch generates a greater number of pseudo labels than FixMatch during the early stages of training (Figure 3a) without sacrificing their quality.

Related Work
Semi-Supervised Text Classification.Semisupervised learning has attracted a lot of attention in the field of text classification due to its ability to leverage large amounts of unlabeled data with limited labels.UDA (Xie et al., 2020) introduces strong data augmentation and proposes a consistency loss to minimize the distance of predicted distributions between differently perturbed data.MixText (Chen et al., 2020) leverages Mixup (Zhang et al., 2018) to interpolate unlabeled data and labeled data in hidden space to avoid overfitting the limited labeled data.S2TC-BDD (Li et al., 2021) notices the margin bias issue and addresses it by balancing label angle variances.PCM (Xu et al., 2022) exploits the inherent semantic matching capability inside pre-trained language models to benefit SSTC.AUM-ST (Sosea and Caragea, 2022) builds on self-training and uses Area Under the Margin (Pleiss et al., 2020) to filter possibly inaccurate pseudo-labels.Hosseini and Caragea (2023) combine two models in a co-training fashion, one being trained using unsupervised domain adaptation from a source to a target domain and the other using semi-supervised learning in the target domain where the two models iteratively teach each other by interchanging their high confident predictions.CrisisMatch (Zou et al., 2023) studies several popular semi-supervised components and integrates Mixup with pseudo-labeling (Lee et al., 2013) for disaster tweet classification.Building upon the idea of FixMatch (Sohn et al., 2020) that uses pseudo-labels generated by weaklyaugmented data to teach strongly-augmented data, the recent work SAT (Chen et al., 2022) proposes to re-rank different weak and strong augmentations according to their similarity with the original input.However, these works ignore the difficulties of learning different classes at different time steps.Inspired by recent FlexMatch (Zhang et al., 2021) and FreeMatch (Wang et al., 2023b), our Joint-Match introduces adaptive thresholding to SSTC to dynamically adjust classwise thresholds based on the learning status of each class.
Learning with Noise.The task of learning with noise aims to train robust networks against label noise.Co-Teaching (Han et al., 2018) proposes to simultaneously train two networks and mutually select small-loss instances to teach its peer, thus avoiding direct error accumulation within one network.Nevertheless, as the training goes on, the two networks may converge, reduce to one identical network, and the problem of error accumulation resurfaces.Decoupling (Malach and Shalev-Shwartz, 2017) proposes to update models only by the data receiving different predictions from the two networks.Co-Teaching+ (Yu et al., 2019) observes that this strategy can keep the two networks diverged and further boost the performance of Co-Teaching.While these approaches are originally designed for fully supervised settings, our Joint-Match integrates their ideas into semi-supervised text classification.We provide a comprehensive comparison between closely related approaches with our JointMatch in Table 1.

Conclusion
In this paper, we propose JointMatch, a semisupervised text classification approach that integrates ideas from recent semi-supervised learning and learning with noise.Our method is motivated by observed problems in semi-supervised text classification (SSTC): model bias towards easy classes and error accumulation.JointMatch utilizes adaptive thresholding, cross-labeling, and weighted disagreement & agreement updates to address these issues effectively.Through extensive experiments on three standard SSTC benchmark datasets, we found that JointMatch significantly outperforms previous works on all datasets and demonstrates surprising improvement over FixMatch on an extremely scarce label setting.We also show that JointMatch can generalize well to datasets that have a large number of classes.

Limitations
There are several limitations to our work.First, our method trains two networks simultaneously, resulting in a higher computational cost and longer training time.For example, each experiment on AG News takes approximately 2.5 hours on an A5000 GPU for JointMatch, while FixMatch only requires 1.25 hours.However, the additional computational cost is not a significant issue in low-resource settings, as experiments in such settings typically do not take much time.Second, our weighted disagreement & agreement update could be further improved by adaptively adjusting the disagreement weight based on the degree of mutual agreement among networks and their confidence at different times.This will be explored in the future.

A Hyperparameters
A complete list of JointMatch hyperparameters on all evaluated datasets are provided in Table 10.

B Additional Ablation Study Results
This section provides additional ablation study results on several hyperparameters as complementary to the main paper.All experimental results here are obtained from JointMatch on AG News with 10 labels per class.

B.1 EMA Decay
Table 11 shows the result of JointMatch with different EMA decay λ for estimating classwise learning status.Note that λ = 1 means the estimated learning status for all classes always equals their initial value 1/C and not adjusting local thresholds (C is the number of classes).We observe that adjusting local thresholds based on estimated learning (λ ∈ [0, 1)) is significantly better than not adjusting local thresholds (λ = 1).

B.2 Fixed Confidence Threshold
In Table 12, we show the performance of Joint-Match with varying fixed confidence threshold τ .It can be seen that setting a threshold higher than 0.75 is generally beneficial for good performance, but setting it to a too high value will lead to some performance drop, as less number of unlabeled can be utilized.

B.3 Unlabeled Data Ratio in MiniBatch
In Table 13

C Short Descriptions of Added Datasets
We provide a short description of these datasets here: (1) GoEmotions (Demszky et al., 2020) is a dataset of Reddit comments labeled with 27 emotions, such as amusement, fear, and gratitude.(2) Empathetic Dialogues (Rashkin et al., 2019) consists of conversations between a speaker and listener and is labeled with 32 fine-grained emotions.
(3) Hurricane is a crisis tweet dataset sampled from HumAID (Alam et al., 2021).It contains humanlabeled tweets collected during hurricane disasters and includes 8 crisis-related classes, such as infrastructure and utility damage, displaced people and evacuations.

Figure 1 :
Figure 1: Number of pseudo labels (a, b) and validation accuracy (c) of FixMatch and JointMatch on AG News with 10 labels per category.Pseudo-labeling approaches that use a fixed threshold, such as FixMatch, suffer from pseudo-label bias towards easy classes and error accumulation along the training.In contrast, JointMatch effectively balances pseudo-labels and avoids degenerated performance caused by error accumulation.

Figure 2 :
Figure2: Pipeline of JointMatch.Unlabeled data undergo both weak and strong data augmentation.Weakly augmented data are first fed into two differently initialized models.Predictions with confidence surpassing adaptive local thresholds are selected as pseudo labels for training.Notably, pseudo labels from one model are used to teach its peer network, i.e, cross-labeling.Unlabeled data loss is then computed between pseudo labels from weakly-augmented data and model predictions from strongly augmented data.Here, we weigh more disagreement data to keep the two models diverged for effective mutual learning.
{Adjust the local threshold for each class based on its learning status} 7: end for 8: end for 9: for b = 1 to µB do 10:

Figure 3 :
Figure 3: The quantity and quality of pseudo labels on AG News with 10 labels per class.Ratio refers to the fraction of pseudo-labeled data that participate in the current training iteration.JointMatch improves the quantity of pseudo labels at the early training stage, while maintaining a high quality of pseudo labels throughout the whole training process.

Table 2 :
, we use the original test set and randomly sample from the training set to construct our training unlabeled set and validation set.Table2presents the dataset statistics and split information.We report the mean and standard Dataset statistics and split information.The number of unlabeled data, validation data and test data in the table means the number of data per class.

Table 3 :
(Chen et al., 2022)cy (%) and macro-F1 (%)) comparison with baselines on different text classification datasets.JointMatch delivers best results on all benchmark datasets, surpassing the latest work SAT(Chen et al., 2022)on SSTC by 5.13% on average.The best results are shown in blue.c: number of classes.

Table 3
(Chen et al., 2022)n et al., 2022), we randomly sample N c samples per class as labeled data for train-

Table 4 :
Accuracy on Yahoo!Answer with varying numbers of labeled data.
ing.N c is set to 10 for both the AG News and IMDB datasets, and 20 for the Yahoo!dataset.

Table 7 :
Statistics and split information of the added datasets.All numbers shown here are the total number of data in all classes.
Table6summarizes the accuracy results in the 10-shot setting.We also

Table 10 :
AG News Yahoo!Answer IMDB Complete list of JointMatch hyperparameters on AG News, Yahoo!Answer, IMDB.

Table 11 :
Ablation study on EMA decay.

Table 12 :
Ablation study on the fixed threshold.

Table 13 :
, we present the results of JointMatch with different unlabeled data to labeled data ratio µ.One can observe that using a large amount of unlabeled data can help increase performance.A very high value of µ is not encouraged since it slows down the training while just giving marginal improvements.Ablation Study on Unlabeled Data to Labeled Data Ratio.