Rethinking Semi-supervised Learning with Language Models

Semi-supervised learning (SSL) is a popular setting aiming to effectively utilize unlabelled data to improve model performance in downstream natural language processing (NLP) tasks. Currently, there are two popular approaches to make use of unlabelled data: Self-training (ST) and Task-adaptive pre-training (TAPT). ST uses a teacher model to assign pseudo-labels to the unlabelled data, while TAPT continues pre-training on the unlabelled data before fine-tuning. To the best of our knowledge, the effectiveness of TAPT in SSL tasks has not been systematically studied, and no previous work has directly compared TAPT and ST in terms of their ability to utilize the pool of unlabelled data. In this paper, we provide an extensive empirical study comparing five state-of-the-art ST approaches and TAPT across various NLP tasks and data sizes, including in- and out-of-domain settings. Surprisingly, we find that TAPT is a strong and more robust SSL learner, even when using just a few hundred unlabelled samples or in the presence of domain shifts, compared to more sophisticated ST approaches, and tends to bring greater improvements in SSL than in fully-supervised settings. Our further analysis demonstrates the risks of using ST approaches when the size of labelled or unlabelled data is small or when domain shifts exist. We offer a fresh perspective for future SSL research, suggesting the use of unsupervised pre-training objectives over dependency on pseudo labels.


Introduction
Pre-training (PT) language models (LMs) (Devlin et al., 2019;Liu et al., 2019;Radford et al., 2019) over large amounts of text data (e.g. with masked language modelling) and then fine-tuning on task-specific labelled data offer large performance gains across NLP tasks.Semi-supervised learning (SSL) (Grandvalet and Bengio, 2004;Chapelle et al., 2009;Kipf and Welling, 2017) is a powerful and effective approach to utilize unlabelled data.A typical SSL setting assumes access to a (relatively small) labelled training set and an (often large) unlabelled set.The goal of SSL is to make effective use of the unlabelled data to improve model (i.e.LMs) performance.
In NLP, Self-training (ST) approaches have been proposed to produce pseudo labels for unlabelled examples to train the model (e.g. in Yarowsky, 1995;McClosky et al., 2006).With the advent of neural networks, ST approaches typically focus on using student-teacher models to assign pseudolabels to the unlabelled data (e.g. in Artetxe et al., 2018;Cai and Lapata, 2019;Dong and de Melo, 2019;Xie et al., 2020a;Gera et al., 2022).Apart from the sophisticated ST approaches, Gururangan et al. (2020) proposed task adaptive pre-training (TAPT), which is a straightforward yet effective method for utilising unlabelled examples.This method involves continuing pre-training the LM on the task-specific data without using labels, before proceeding with fully-supervised fine-tuning.TAPT and ST are both motivated by the need for effectively leveraging unlabelled examples, raising the questions of how TAPT performs in SSL tasks, as well as how these two approaches perform against each other.
In this work, we investigate the performance of TAPT against five state-of-the-art ST approaches across five NLP tasks ( §4).We empirically show that TAPT outperforms all state-of-the-art ST approaches on several tasks, suggesting that it should serve as a strong baseline for SSL methods.Previous research (Gururangan et al., 2020) has shown that TAPT can improve performance in fullysupervised settings.Our study goes further by showing that TAPT can be even more effective in SSL settings ( §4).
We next study the impact of using different amounts of labelled and unlabelled data for SSL ( §5).Our experiments show that ST approaches are prone to suffering from insufficient labelled or unlabelled data, while TAPT is more robust across different combinations of labelled and unlabelled data sizes.Contrary to the common assumption that TAPT requires a large amount of data to perform well (e.g.Li et al., 2021b;Hou et al., 2022), our results show that TAPT improves performance with just a hundred unlabelled samples.We conduct further analysis on the impact of domain shifts in labelled or unlabelled data.While ST approaches generally suffer from domain shifts, TAPT is more robust and even benefits from domain shifts ( §6).In summary, the main contributions of this paper are as follows: • An extensive empirical study to directly compare five state-of-the-art ST approaches and TAPT across various NLP tasks in SSL, with varying amounts of labelled and unlabelled data as well as the effect of domain shifts; • Practical insights learned about the limitations of ST approaches, alongside an exploration of the often-unrecognized yet impressive capacity of TAPT as a simple, stable and powerful SSL learner; • A fresh perspective for future SSL research by demonstrating that leveraging unsupervised signals from unlabelled texts presents a promising and effective approach alternative to dependence on pseudo labels.

Preliminaries
2.1 Task Adaptive Pre-training (TAPT) LMs are adapted to downstream NLP tasks by finetuning (FT) on task-specific data.TAPT introduces a simple additional step before fine-tuning by continuing pre-training with a masked language modelling (MLM) objective (Devlin et al., 2019;Liu et al., 2019) on the task-specific data without requiring labels.The main advantage of TAPT is that it provides a simple way for the LM to explore the task space while it can easily make use of all available labelled and unlabelled data.

Self-training (ST)
The core idea behind ST  The ST framework is trained with three main steps as follows.
Step 1.A teacher model F , parameterized by a neural network Θ, is trained via minimizing the cross entropy loss ℓ on labelled examples L: Step 2. The teacher model F is used to make predictions (referred to as "pseudo-labels") on unlabelled examples U : where ỹi can be either the continuous logit or the discrete label induced by an ARGMAX operation.
Step 3. A student model G, parameterized by a fresh neural network Φ, is trained to fit labelled and pseudo-labelled examples: This process is repeated for a given number of times by treating the student as a new teacher to re-predict pseudo-labels as in eq. ( 2) and then training a new student with eq.(3).In practice, ST with techniques such as consistency regularization (Miyato et al., 2018;Clark et al., 2018;Berthelot et al., 2019b), strong data augmentation (Sohn et al., 2020;Xie et al., 2020b,a), confidence threshold (Sohn et al., 2020;Zhang et al., 2021;Berthelot et al., 2022) usually leads to substantial improvements in model performance.

Experimental Setup
Datasets.We experiment with five datasets used in previous related work for SSL (Gururangan et al., 2019;Chen et al., 2020b;Xie et al., 2020a;Li et al., 2021a;Gera et al., 2022), including IMDB (Maas et al., 2011), SST-2 (Wang et al., 2018), AG NEWS (Zhang et al., 2015), AMAZON REVIEW (McAuley and Leskovec, 2013), and YAHOO!AN-SWER (Chang et al., 2008).(Wang et al., 2018) Movie Review Sentiment 60,000 7,349 872 2 37 AG NEWS (Zhang et al., 2015) News Topic Classification 100,000 10,000 7,600 4 134 AMAZON REVIEW (McAuley and Leskovec, 2013) Product Review Sentiment 250,000 25,000 650,000 5 79 YAHOO!ANSWER (Chang et al., 2008) Topic Classification 500,000 50,000 60,000 10 32 statistics.We also provide descriptions and examples of datasets in Appendix §A.1.We show the process for quantifying the similarity between datasets in Appendix §A.2.Adhering to previous work (e.g.Chen et al., 2020b;Wang et al., 2022), we sample the same amount of labelled data per class from the train set, given the labelled size, to form the labelled set.We re-sample the labelled data using the same five seeds for all different approaches and report the average performance with an error bar.
TAPT.Our approach to task adaptive pretraining (TAPT) using ROBERTA-BASE (Liu et al., 2019) is to further pre-train on the training text corpus including labelled and unlabelled data (see Table 12 in Appendix for hyperparameter details).
The model is then fine-tuned on the labelled data where the [CLS] token representation is passed to an extra feed-forward layer for classification (see Table 13 in Appendix for hyperparameter details).
The process of TAPT + FINE-TUNING is simply denoted by TAPT henceforth.
Baselines.For reference, we also evaluate two baseline models that are only fine-tuned (from an off-the-shelf ROBERTA-BASE checkpoint) on: (1) the same labelled set as TAPT and ST (SUPERVISED); and (2) the whole training set (FULLY-SUPERVISED).

ST vs TAPT
Overview.Table 2 shows the performance of TAPT against five state-of-the-art ST approaches and the baselines (SUPERVISED and FULLY-SUPERVISED) across five datasets, each with two different sizes of labelled data for training following Wang et al. (2022).Overall, we observe that: (1) TAPT achieves highly competitive results compared with state-of-the-art ST approaches; and (2) TAPT gains more improvement compared to the SUPERVISED baselines when using fewer labelled samples.
For our first finding, the experimental results show that TAPT outperforms all five state-of-theart ST approaches with lower variances on AMA-ZON REVIEW, and YAHOO!ANSWER, as shown in Table 2.For example, TAPT obtains a F 1 score of 68.8% compared to the best ST approach's F 1 score of 68.0% (using 500 labelled samples) and 71.5% compared to ST's 69.6% (using 2000 labelled samples) on YAHOO!ANSWER.For an example of the second finding, TAPT gains 3.6% F 1 improvement over SUPERVISED (using 20 labelled samples) compared to 2.2% (using 100 labelled samples) on IMDB.Below we delve deeper into these two findings and discuss them in more detail.
#1. TAPT is a strong semi-supervised learner and can outperform state-of-the-art ST approaches.Figure 1 shows how the performance of ST, TAPT, and SUPERVISED vary with respect to five different labelled sizes on each dataset, where two latest ST approaches (ADAMATCH and FLEX-MATCH) are selected as representatives for ST.Experimental results further verify that TAPT has a consistent advantage over ADAMATCH and FLEX-MATCH across different labelled sizes on AMAZON REVIEW and YAHOO!ANSWER.It is also worth noting that, while TAPT brings a stable improvement over SUPERVISED across all datasets with varying labelled sizes, ST can sometimes bring more substantial improvement, for example when #2. TAPT tends to bring more improvements in SSL than in FULLY-SUPERVISED setting.We further study the behaviour of TAPT itself under SSL, where we select SUPERVISED as the baseline rather than ST approaches.Figure 1 shows that the differences in performance (in absolute values) between TAPT (red lines) and SUPERVISED (green lines) generally increase as the labelled size decreases.To gain a better understanding of the impact of labelled data sizes, we plot the improvement from TAPT over SUPERVISED (in percentages) against the ratio between labelled size and unlabelled size (unlabelled size is fixed for each dataset) in Figure 2. We see that TAPT improves over SU-PERVISED further as the ratio of labelled and unlabelled sizes decreases, highlighting the trends of gaining greater improvement in low-resource SSL setting.This finding is complementary to prior works (e.g. in Howard and Ruder, 2018;Gururangan et al., 2020) that focus on TAPT's improvement from the FULLY-SUPERVISED perspective, represented by the rightmost red vertical line in Figure 2.

Exploring the limits of ST and TAPT
In §4, our experimental results showed inconsistent results across datasets.For example, ST performs better on IMDB while TAPT achieves better results on AMAZON REVIEW and YAHOO!ANSWER.We hypothesize that this might be attributed to the exposure to different sizes of labelled or unlabelled data.To verify this hypothesis and shed light on the differences in performance between datasets, we compare TAPT and ST (using ADAMATCH and FLEXMATCH as representatives) by sampling different labelled and unlabelled sizes in IMDB, SST-2, AMAZON REVIEW and YAHOO!ANSWER.Figure 3 visualizes the differences in performance between TAPT and ST, where each cell represents the macro-F 1 performance difference of TAPT over ST (averaged across five seeds).In each case, the highest performance among FLEXMATCH and ADAMATCH is selected to represent the performance of ST.Overall, we observe that: (1) TAPT improves the fine-tuning performance even with a few hundred unlabelled examples; and (2) TAPT performs more stable across the different labelled and unlabelled data sizes than ST approaches.Below we provide a comprehensive analysis of the impact of labelled and unlabelled sizes.
#1. TAPT works even with a few hundred unlabelled samples.It is generally assumed that TAPT requires a large amount of unlabelled data to perform well (e.g.Li et al., 2021b;Hou et al., 2022).However, we surprisingly observe that TAPT can bring substantial improvement over SUPERVISED baseline even with a relatively small number of unlabelled samples, as shown in Figure 5.To explore the effectiveness of TAPT over SUPERVISED in the low-resource setting of unlabelled data, we select the performance of TAPT and SUPERVISED from the first column (the lowest unlabelled size) for each dataset in Figure 3 and plot their average performance over different labelled sizes.Figure 4 shows that TAPT improves over the SUPERVISED baseline with just one hundred or one thousand samples.For instance, TAPT achieves a 5.5% increase in F 1 score compared to the SUPERVISED baseline when using only 1k unlabelled samples on YAHOO!ANSWER.do not provide adequate information.This might be attributed to confirmation bias (Tarvainen and Valpola, 2017;Arazo et al., 2020), which results from the accumulation of errors in the iterative ST process caused by incorrect pseudo-labels.The specific value of adequate labelled size boundary for ST approaches depends on the nature of the dataset.For example, even though both IMDB and SST-2 are binary classification tasks for movie review sentiment analysis, the labelled size boundary for SST-2 is higher (40 > 4), indicating that this boundary tends to increase as the task becomes more challenging.While it may be easy to obtain dozens of labelled data in this case, when the task becomes more intricate or contains noisy weak labels, it is important to be aware of this potential issue with ST approaches.TAPT could serve as an alternative in situations where collecting adequate labelled data for training is costly.We provide specific values of the performance of ST and TAPT, and further verify that this finding applies to other ST approaches in Appendix §D.
#3. Adequate labelled data and scarce unlabelled data.In this setting, TAPT is more robust, while ST has a greater chance of performing worse than the SUPERVISED baseline.In Figure 5, we plot the performance of ST approaches and TAPT against five different sizes of unlabelled data, grouped by size (using similar colours).We note that ST approaches perform worse than their corresponding SUPERVISED baselines (represented by horizontal lines) until a certain amount of unlabelled data has been reached.For example, when the labelled size is 500, ST requires about 20k unlabelled samples to achieve the corresponding SUPERVISED baseline performance on YAHOO!ANSWER.On the other hand, TAPT generally outperforms SUPERVISED baselines demonstrating its robustness across various unlabelled sizes.
To further quantify the model performance in case of scarce unlabelled and adequate labelled data, we choose the three lowest unlabelled sizes (the first three columns) excluding the lowest labelled size (the last row) in Figure 3 for each dataset.Our analysis shows that ST has 67%, 56% and 54% probability of falling below the SUPER-VISED baselines on SST-2, AMAZON REVIEW, and YAHOO!ANSWER respectively.Even on IMDB where ST generally performs well, it still has a probability of 33% to fall behind SUPER-VISED.In contrast, TAPT never performs worse than SUPERVISED in those cases.We provide computation details and comparative statistics in Appendix §C.
The specific value of adequate unlabelled size boundary for ST approaches depends on the nature of the dataset as well as the labelled size.Figure 5 illustrates that as the size of the labelled data increases, ST approaches require more unlabelled data to surpass the SUPERVISED baselines.For example, on AMAZON REVIEW, ST trained with 100 labelled samples requires about 5k unlabelled samples to perform better than SUPERVISED, while ST trained with 10k labelled samples requires about 100k unlabelled samples.Adjusting the unlabelled size accordingly might be conducive to exploiting the full potential of ST approaches.
#4. Scarce labelled and unlabelled data.When the labelled data is insufficient, increasing unlabelled size is not helpful or even detrimental to ST approaches.This finding is well-illustrated in the last row of results on SST-2 shown in Figure 3.In  other words, reducing the size of unlabelled data could be beneficial for ST approaches when the labelled size is inadequate.We further zoom in on this phenomenon in Table 3 by selecting 4 fixed labelled and 500 unlabelled samples, and gradually removing unlabelled samples on IMDB.This is a stark contrast to the case where more unlabelled data is beneficial for ST approaches when adequate labelled data is available.Meanwhile, TAPT generally benefits from training on more in-domain unlabelled data, following the scaling law in LMs (Kaplan et al., 2020;Hoffmann et al., 2022).
Both ST and TAPT have demonstrated the ability to exploit unlabelled data in this setting.Figure 3 shows that ST dominates in IMDB when more than 10 labelled and 100 unlabelled samples are available.On the other hand, TAPT generally performs better than ST on AMAZON REVIEW and YAHOO!ANSWER, indicating that the answer to which approach is better depends on the nature of the dataset and task.As labelled and unlabelled data size increase, the difference between ST and TAPT shrinks (colours fade and lines converge in Figures 3 and  5).As the labelled data in size reaches the unlabelled data, the method of ST reduces to FULLY-SUPERVISED, which is generally outperformed by TAPT (Gururangan et al., 2020).

Domain Adaptation
We next investigate how ST and TAPT compare in the presence of domain shifts between labelled and unlabelled data in two additional settings (refer to Table 5).First, we experiment with the Unsupervised Domain Adaptation (UDA) setting, where domain shifts exist between the labelled data from a source domain and the unlabelled data from the target domain (Ben-David et al., 2010;Saito et al., 2018;Ramponi and Plank, 2020).Then, we experiment with Self-taught Learning (STL) (Raina et al., 2007) in a domain adaptation setting, where the unlabelled data come from the source domain and the labelled data from the target domain.In both settings, we use the (labelled) validation and test sets from the target domain.Validation and test sets are excluded from any pool of labelled or unlabelled train data.
In this setting, we use two movie sentiment datasets, IMDB and SST-2, as the source and target domain (and vice versa) with two different sizes of labelled data (i.e. 100 and 200).
Figure 6 depicts the performance of ST and TAPT in UDA.In case of domain shifts, we observe that FLEXMATCH and ADAMATCH fail to deliver satisfactory results and their performance drops to the level of random guessing, with a F 1 score of 33% across all labelled sizes and datasets.This highlights the vulnerability of ST approaches in UDA.In contrast, TAPT demonstrates robust performance even with domain shifts, on par with its own SSL performance without domain shifts.Additionally, TAPT even benefits from training on the source domain.For instance, training on IMDB (source domain) further improves the performance of TAPT on SST-2 (target domain) from 86.4% to 89.6% with 100 labelled samples and from 88.6% to 89.7% with 200 labelled samples.
#2. Self-taught Learning (STL).We select IMDB, SST-2, and AMAZON REVIEW for this setting.Although they are all sentiment reviews datasets, IMDB and AMAZON REVIEW are more closely related (see the similarity analysis in Table 7 of Appendix) and arguably contain richer language than SST-2 (see examples in Table 6 of Appendix).
Table 4 presents the performance of ST and TAPT in STL setting.We find that domain shifts in unlabelled data consistently hurt the performance of ST, depending on the similarity between the source and target domains.The performance of ST drops sharply if the source and target domains are vastly different.For example, when SST-2 is used as the labelled data (target domain) and IMDB or AMAZON REVIEW is used as unlabelled data (source domain), the performance of ST falls from over 80% to around 60% or lower.On the other hand, when using SST-2 and IMDB as the source and target domains, the performance of ST drops by a much smaller margin (a few percentage points).This shows the importance of training ST approaches using more informative labelled data, which is also consistent with our findings in §5.
TAPT in the STL setting is in fact a variation of domain adaptive pre-training (Beltagy et al., 2019;Gururangan et al., 2020) applied to SSL tasks.Table 4 shows that the performance of TAPT remains stable when there exist domain shifts in the unlabelled data.Using more informative unlabelled data can further improve the performance of TAPT.For example, using IMDB or AMAZON REVIEW as unlabelled data when SST-2 is a target task, we see an improvement of about 4% with 100 labelled samples.However, it is worth noting that ST methods can still be competitive compared to TAPT if the source and target domains are relatively similar.For instance, when using AMAZON REVIEW and IMDB as the source and target domains, ST still achieves better results than TAPT.

Related Work
Leveraging unlabelled data by Continuing Pretraining.Previous work has shown that further pre-training LMs on the unlabelled data of a task (e.g.Alsentzer et al., 2019;Mehri et al., 2020;Margatina et al., 2022) or in-domain data (e.g.Logeswaran et al., 2019;Gururangan et al., 2020;Xue et al., 2021) is beneficial to downstream tasks.However, it is unknown whether this is valid in SSL settings.Previous studies in computer vision (Zoph et al., 2020) and speech recognition (Xu et al., 2021a) have compared PT and ST.However, our study has a different focus, specifically, we compare TAPT and ST in NLP tasks.Concurrently to our work, Shi and Lipani (2023) put forward prompt-based continued pre-training, which primarily aims to enhance the performance of promptbased fine-tuning techniques (Schick and Schütze, 2021;Gao et al., 2021).This approach outperforms these state-of-the-art ST approaches (Sohn et al., 2020;Xu et al., 2021b;Zhang et al., 2021;Berthelot et al., 2022) as well as the conventional CLS-based fine-tuning with TAPT.
Semi-supervised Learning.Recent work in SSL has demonstrated great progress in effectively exploiting unlabelled data.A wide range of approaches has been proposed including Pseudo Labeling (Lee et al., 2013), Temporal Ensemble (Laine and Aila, 2017), Mean Teacher (Tarvainen and Valpola, 2017), Virtual Adversarial Training (Miyato et al., 2018), FixMatch (Sohn et al., 2020).A major issue for ST approaches is confirmation bias, where the student model would accumulate errors from the teacher model when learning with inaccurate pseudo-labels (e.g.Wang et al., 2021;Goel et al., 2022;Chen et al., 2022).
While many efforts towards ST (e.g.Ruder and Plank, 2018;Gururangan et al., 2019;Li et al., 2019;Chen et al., 2020b;Meng et al., 2020;Chen et al., 2020a;He et al., 2020;Gera et al., 2022) have been made in NLP, the performance of ST approaches across various labelled and unlabelled sizes has yet to be thoroughly explored.Although Mukherjee and Awadallah (2020); Li et al. (2021b) noted that training ST approaches from TAPT checkpoints can improve the performance, the performance of TAPT in SSL tasks has not been either well-researched by previous works or compared with state-of-the-art ST approaches.

Conclusion
In this work, we shed light on how TAPT performs against state-of-the-art ST approaches in various SSL settings.Our experiments reveal that TAPT achieves strong and robust performance, even with just a few hundred unlabelled examples.We further demonstrate that the ST approaches are vulnerable to small amounts of either labelled or unlabelled data.We also find that TAPT is more robust than ST approaches in joint domain adaptation and SSL settings.Overall, our empirical study demonstrates that TAPT is a strong SSL learner, competitive to more sophisticated ST approaches.In future work, we plan to further explore the potential of TAPT with unsupervised learning signals.

Limitations
For easier comparison with previous work, we only focus on text classification tasks, while ST can also be applied to a variety of NLP tasks, such as language generation, conversational systems and commonsense reasoning (Kedzie and McKeown, 2019;He et al., 2020;Shi et al., 2022a,b;Hendriksen et al., 2022).We also assume that the datasets are roughly balanced.However, real-world datasets are usually class-imbalanced (Li et al., 2011), which might impact the performance of TAPT and ST.While this is out of the scope of this paper, we believe that this is an interesting avenue for future work.Additionally, different labelled and unlabelled sizes may impact the performance of ST approaches in the domain shift setting.However, this doesn't alter our conclusion that the effectiveness of ST approaches significantly fluctuates across different scenarios.
IMDB.The IMDB dataset (Maas et al., 2011) contains a collection of 50 000 reviews from the Internet Movie Database, with no more than 30 reviews per movie.This dataset contains an equal number of positive and negative reviews, yielding a 33% Marco-F 1 score for random guessing.There are 25 000 and 25 000 for training and testing, respectively.We follow Wang et al. (2022) to split the dataset by selecting 12 500 samples and 1 000 samples per class from the train set to form a train and validation set, respectively.SST-2.The SST-2 dataset (Wang et al., 2018) consists of sentences from movie reviews and human annotations of their sentiment.The task is to predict the sentiment of a given sentence.Similar to IMDB, this is also a binary classification task.There are 67 349 and 872 for training and testing.We select 60 000 and 7 349 samples from the train set to form a train and validation set, respectively, where the validation set contains 3 675 and 3 674 samples for two classes, respectively.
AG NEWS.The AG NEWS topic classification dataset is constructed by Zhang et al. (2015), where 4 classes are used.Each class contains 30 000 training samples and 1 900 test samples.We follow Wang et al. (2022) to split the dataset by selecting 25 000 samples and 2 500 samples per class from the train set samples to form a train and validation set, respectively.
AMAZON REVIEW.The AMAZON REVIEW dataset (McAuley and Leskovec, 2013) is a sentiment classification dataset, with five classes.There are 600 000 train samples and 130 000 test samples per class.We follow Wang et al. (2022) to split the dataset by selecting 50 000 samples and 5 000 samples per class from the train set samples to form a train and validation set, respectively.YAHOO!ANSWER.The YAHOO! ANSWER dataset (Chang et al., 2008) is a topic classification dataset, with ten classes.There are 140 000 train samples and 6 000 test samples per class.We follow Wang et al. (2022) to split the dataset by selecting 50 000 samples and 5 000 samples per class from the train set samples to form a train and validation set, respectively.

A.2 Dataset Similarity
We provide an analysis of the vocabulary overlap of the datasets, as shown in Figure 7. Additionally, in Table 7, we provide some examples to illustrate the overlap between IMDB and AMAZON REVIEW.
As shown in Table 6, although both the SST-2 and IMDB datasets are sentiment analysis tasks for movie reviews, the SST-2 datasets contain shorter and vaguer sentences than the IMDB dataset.This difference could be a potential reason for poor performance of ST approaches in the UDA setting ( §6).In contrast, the AMAZON REVIEW dataset, which is a product review sentiment analysis dataset, is more similar to the IMDB dataset than the SST-2 dataset, as shown in Table 7.This suggests a poten- tial reason for the performance of ST and TAPT in the STL setting ( §6).

B ST Frameworks
VAT. VAT (Miyato et al., 2018) proposed a regularization technique that forces pairs of data points that are very close in the input space to be close to each other in the output space.VAT adds small perturbation to the input data and forces the model to produce similar predictions.
FIXMATCH.FIXMATCH (Sohn et al., 2020) generates artificial labels using both consistency regularization and pseudo-labelling, where the artificial labels are produced based on weakly-augmented unlabelled data.These artificial labels are then used as targets to train the model on strongly-augmented unlabelled data.FIXMATCH only retains an artificial label if the model assigns a high probability to one of the possible classes.
DASH.DASH (Xu et al., 2021b) extends FIX-MATCH by introducing a mechanism with a dynamically adjusted threshold of loss to select a subset of training examples from the unlabelled data for performing SSL.
FLEXMATCH.FLEXMATCH (Zhang et al., 2021) also extends FIXMATCH by introducing the concept of curriculum learning (Bengio et al., 2009) to flexibly adjust thresholds for different classes at each time step and select unlabelled data and their pseudo labels that are more likely to be informative.
ADAMATCH.ADAMATCH (Berthelot et al., 2022) aims to solve domain adaptation problems in SSL and build a high-accuracy model that trains on and tests on different data distributions.ADAMATCH builds on FIXMATCH and introduces a relative confidence threshold and a modified distribution alignment from (Berthelot et al., 2019a).
C Probability of performing worsen than SUPERVISED.
In §5, we discuss that we select the model performance with the three lowest unlabelled sizes (the first three columns in Figure 3) for each dataset and exclude the model performance with the lowest labelled size (the last row in Figure 3).This results in 9 cells in IMDB, 3 cells in SST-2, 9 cells in AMAZON REVIEW, and 12 cells in YA-HOO!ANSWER, where TAPT has one run per cell and ST (FLEXMATCH and ADAMATCH) has two runs per cell.We consider a run to be a failure if its performance is worse than its corresponding SUPERVISED baseline.Table 8 lists the probability of ST and TAPT of falling below the SUPERVISED baseline with selected combinations of labelled and unlabelled sizes.

D Further validations with other ST approaches
In this section, we conduct additional experiments on ST approaches, including VAT, DASH, and FIX-MATCH to demonstrate that our findings are applicable to other ST approaches as well.
In Table 9, we select several combinations of labelled and unlabelled sizes on IMDB, SST-2, AMAZON REVIEW, and YAHOO!ANSWER datasets.Our experimental results show that other ST approaches do not perform well when the labelled size is low, and that other ST approaches have a high probability to perform worsen than SUPERVISED baselines when the unlabelled size is low.This suggests that poor performance when the labelled or unlabelled size is inadequate may be a common problem of state-of-the-art ST approaches.

E Train ST approaches with TAPT checkpoints
Previous works (Mukherjee and Awadallah, 2020;Li et al., 2021b) have suggested that training ST approaches from a TAPT checkpoint may be beneficial.Here we also provide some additional experiments to train ST approaches with TAPT checkpoints to further corroborate our findings.
Table 10 shows that TAPT outperforms ADAMATCH +TAPT or FLEXMATCH +TAPT with two different labelled sizes on the YAHOO!AN-SWER dataset.
Table 11 shows that training ST approaches from TAPT checkpoints could improve the performance of ST but cannot solve the issue of ST approaches when labelled or unlabelled data is not adequate.Specifically, the performance of ST +TAPT is still poor when labelled data is insufficient, as discussed in §5.Meanwhile, in Table 11, the performance of ST +TAPT could be outperformed by the SU-PERVISED baselines when unlabelled data is inadequate, while TAPT consistently outperforms the SUPERVISED baselines.When the labelled size is 10, the performance of ST trained with fewer unlabelled samples tends to be better, indicating that reducing the number of unlabelled data can be helpful, as discussed in §5.

F Implementation Details
We consistently use five random seeds, ranging from 1 to 5, for all algorithms.The sampled labelled data is the same for all algorithms for a given seed.The development and test sets remain unchanged for all different labelled and unlabelled data sizes.
Our model implementation uses open-source libraries including HuggingFace Transformers2 , Fairseq3 , and USB4 .Our experiments of TAPT are performed on 8x32GB V100 GPUs, with a batch size of 16 per device and 2 gradient accumulation steps.
Table 12 lists the hyperparameters used for the TAPT phrase.Table 13 lists the hyperparameters used for the fine-tuning phrase.Table 14 lists the hyperparameters used for ST approaches.

Dataset Example
IMDB I watched this movie after seeing other comments on IMDb, even convincing my wife that it was a "unique horror movie."I wanted to like this movie, but was unable to.The "love story" was good, but the horror aspect was quite bad.If the story was just about a young man who fell in love with a girl suffering from parasomnia, then it would have been a better movie.The care centre stretched credulity well past the limits, in fact it was quite ridiculous.The doctor happily ignors privacy laws and professionalism.A nurse goes into a room for a routine feeding of a dangerous patient (without security escort), and drops the tray and runs out of the room screaming for no apparent reason.The forensic patient (and the film's villain) is tied up in a standing position fully clothed -apparently for years?None of it makes much sense.The movie even had some actors that I've liked in other things, such as the detectives, but still I can't recommend this movie.
SST-2 a rewarding work of art for only the most patient and challenge-hungry moviegoers.Table 12: Hyperparameters for task-adaptive pretraining.The learning rate and unlabelled size are tightly connected and need to be adjusted together.We generally recommend increasing the learning rate as you increase the unlabelled size.Different from its predecessor, BERT (Devlin et al., 2019), where the next sentence prediction objective is used, ROBERTA (Liu et al., 2019) is only trained with the MLM objective (i.e., cross-entropy loss on predicting randomly masked tokens), dynamically changing the masking pattern applied to the training examples and typically using the masking probability of 0.15.

Figure 3 :
Figure 3: Performance difference between TAPT and ST with varying labelled and unlabelled sizes on IMDB, SST-2, AMAZON REVIEW and YAHOO!ANSWER.Positive values indicate that TAPT performs better, while negative values indicate that ST performs better.Average Macro-F 1 score on test sets over five seeds is reported.

Figure 4 :
Figure4: The performance of TAPT against the SUPER-VISED baseline in the low-resource setting of unlabelled data.From the left to the right, TAPT utilizes 100, 100, 1 000, and 1 000 unlabelled samples respectively.

FleFigure 6 :
Figure 6: Results of UDA experiments.Legends indicate domains of labelled training data.Orange/green represents the performance with/without domain shift.Average Macro-F 1 score on test sets over five seeds is reported.
approaches is to utilise a teacher model trained on labelled examples to make predictions for unlabelled examples, and train a new student model with these predictions.Formally, let L ≜ {(x 1 , y 1 ), . . ., (x n , y n )} denote n labelled examples and U ≜ { x1 , . . ., xm } denote m unlabelled examples, where usually m ≫ n.
Table 1 shows data

Table 1 :
Statistics of datasets.|Y|: # of classes for classification tasks.L: average # of words in input sentence(s).Note that we only sample examples from the original training set in our experiments.

Table 2 :
Performance of TAPT, ST approaches and the baselines across five datasets using two different sizes of the training labelled data.We report average Macro-F 1 on the test set across five seeds, with standard deviations in subscripts.Blue and orange represent the best and second-best performance in a column respectively.

Table 4 :
Results of STL experiments.We report the average Macro-F 1 score on the test set across five seeds, with standard deviations as subscripts.Blue represents the best result for each row.Stars highlight rows without domain shifts.Arrows in colours stand for the changes in performances against the star row result within each cell.

Table 5 :
A summary of domain adaptation, where the distribution of source and target domains are different.
Teen flies in plane #39;s landing gearA homeless teenager who hid in the landing gear of a passenger plane survived a 700-kilometre flight across south-western China but his companion fell and probably died, state media reported on Friday.AMAZON REVIEWTHIS is MUSIC at its BESTRob Dougan has done it.He's crafted musical perfection, or close to it anyway.I have finally found the music I've been waiting for my whole life in this album -Rob D you are a genius.I think a lot of us wanted to know more about this guy as soon as we heard the track playing to the "Woman in the Red Dress" scene.Now I know why the Wachowski brothers have enlisted his musical talents to flesh out their movies.I know I should be trying to write a more helpful, objective review but I can do nothing but wax poetic for Rob Dougan and his debut album.He has mixed classical melodies with awesome electric beats and it all comes together in an audio orgy.Just buy the album already and let's get Rob some more mainstream recognition.YAHOO!ANSWER Does anybody know a great deal about angels?I'm looking for names, if they're good or bad, what they look like, etc.The more detail the better.All religions accepted

Table 7 :
Similarity analysis between IMDB and AMAZON REVIEW with four examples that highlight the overlap.Fabulous actors, beautiful scenery, stark reality [...] I tried to buy the video for several years, finally bought it used from a video store that went out of business.But Yippee!The DVD is now for sale, I purchased it on amazon.com.Not cheap, but well worth it to me.[...]Well worth the import price.My first impression of this album was a good one, but as time went on it came to grow on me more and more.This is certainly one of the better Costes albums.The mixing is nothing revolutionary, but it is well done and all tracks flow into each other very well.[...].
IMDB AMAZON REVIEWI loved this movie since I was 7 and I saw it on the opening day.It was so touching and beautiful.I strongly recommend seeing for all.It's a movie to watch with your family by far.My MPAA rating: PG-13 for thematic elements, prolonged scenes of disastor, nudity/sexuality and some language.This is a very touching, spiritual movie!When I first saw this film, [...].I was deeply moved by this motion picture, and the DVD brings the story to your own home.The bonus materials could be better, but the main part of the DVD is the actual movie.Great, great, great film... [...] Pacino is over-the-top but to good effect as he's clearly having loads of fun.Beatty is great [...] The lighting, velvet overtones and smog/smoke combine to create a great effect.There are some really funny cameos [...] Highly recommended.4.5/5 stars.[...] Makes a great gift!We bought this book for my dad for Father's Day this year, and thought he would have fun reading it since he has four granddaughters.He loved it and has even selected stories to read to the girls during over-nights with Grandpa and Grandma.I highly recommend it as a great gift.The late [...] scripted this tale of terror and it was absolutely one of the scariest movies I ever saw as a kid.(I had to walk MILES just to see a movie, and it was usually dark when I emerged from the theater; seeing a horror movie was always unnerving [...] Movia ... please ....This movie is a masterpiece of terror & suspence & Beautifully filmed & acted.Comparisons to reality are not allowed when reviewing films of this caliber.Your reaction (though it MAY be sarcastic) is EXACT proof of it's genius!Watch it again...and this time....bask in all it's glory!

Table 8 :
Results on the effect of low unlabelled sizes on ST and TAPT.Failure means performing worsen than SUPERVISED.

Table 9 :
We further verify our conclusion on VAT, DASH, FIXMATCH that .We report the average Macro-F 1 score on the test set across five seeds, with standard deviations as subscripts.Blue represents the best results for each row.

Table 10 :
Results of ADAMATCH +TAPT and FLEXMATCH +TAPT on YAHOO!ANSWER with two different labelled sizes.

Table 11 :
We further verify our conclusion on FLEXMATCH +TAPT.We report the average Macro-F 1 score on the test set across five seeds, with standard deviations as subscripts.Blue represents the best results for each row.

Table 13 :
Hyperparameters for fine-tuning.More epochs are used when the labelled size is low.

Table 14 :
Hyperparameters for self training.Algorithm-specific hyperparameters will be released in configuration files with the code.