Task-adaptive Pre-training and Self-training are Complementary for Natural Language Understanding

Task-adaptive pre-training (TAPT) and Self-training (ST) have emerged as the major semi-supervised approaches to improve natural language understanding (NLU) tasks with massive amount of unlabeled data. However, it's unclear whether they learn similar representations or they can be effectively combined. In this paper, we show that TAPT and ST can be complementary with simple TFS protocol by following TAPT ->Finetuning ->Self-training (TFS) process. Experimental results show that TFS protocol can effectively utilize unlabeled data to achieve strong combined gains consistently across six datasets covering sentiment classification, paraphrase identification, natural language inference, named entity recognition and dialogue slot classification. We investigate various semi-supervised settings and consistently show that gains from TAPT and ST can be strongly additive by following TFS procedure. We hope that TFS could serve as an important semi-supervised baseline for future NLP studies.


Introduction
Deep neural networks (Goodfellow et al., 2016) often require large amounts of labeled data to achieve state-of-the-art performance (Xie et al., 2020). However, acquiring high-quality annotations is both time-consuming and cost-expensive, which inspires research on methods that can exploit unlabeled data to improve performance (He et al., 2020). Pre-trained language models like BERT (Devlin et al., 2019), RoBERTa  and T5 (Raffel et al., 2020) can learn general language understanding abilities from large-scale unlabeled corpora and have reduced this annotation cost. In this paradigm, large neural networks are first pre-trained on massive amounts of unlabeled data in a self-supervised manner and then finetuned on large amount of labeled data for specific downstream tasks, which has led to large im-provements for natural language understanding on standard benchmarks (Wang et al., 2019b,a). However, their success still relies on large amount of data during finetuning stage. For example, Wu et al. (2020) shows that BERT only achieves 6.4% joint goal accuracy with 1% finetuning data for dialogue state tracking task, a core component of task-oriented dialogue systems, making it far behind its full counterpart 45.6%. This data-intensive finetuning poses several challenges for many realworld applications, where collecting large amount of labeled data is not only cost-expensive and timeconsuming, but also infeasible sometimes due to data access and privacy constraints (Wang et al., 2021).
Semi-supervised learning (Thomas, 2009) provides a plausible solution to address aforementioned data hungry issue by making effective use of freely available unlabeled data. One of the most popular semi-supervised learning algorithms is selftraining (Scudder, 1965). In self-training, a teacher model is first trained on available labeled data and then used to generate pseudo labels for unlabeled data. The original hand-annotated labeled data and the pseudo-labeled data are combined together to train a student model. The student model is assigned as a teacher model in next round and the teacher-student training procedure is repeated until convergence or reaching maximum rounds. Selftraining utilizes unlabeled data in a task-specific way during pseudo labeling process (Chen et al., 2020b) and has been successfully applied to a variety of tasks, including image recognition (Xie et al., 2020;Zoph et al., 2020), automatic speech recognition (Kahn et al., 2020), text classification (Du et al., 2021;Mukherjee and Awadallah, 2020), sequence labeling (Wang et al., 2021) and neural machine translation (He et al., 2020).
Recently, task-adaptive pre-training (TAPT) (Gururangan et al., 2020) was further proposed, which can adapt pre-trained language models, e.g. BERT and RoBERTa, to unlabeled in-domain training set to improve performance (Gururangan et al., 2020). The intuition of TAPT is that datasets for specific tasks may only contain a subset of the text within the broader domain and continuing pre-training on the task dataset itself or other relevant data can be useful (Gururangan et al., 2020). TAPT tends to adapt its linguistic representation by utilizing the unlabeled data in a task-agnostic way (Chen et al., 2020b). With the recent success of taskadaptive pre-training and self-training in natural language understanding (NLU), a research question arises: Are task-adaptive pre-training (TAPT) and self-training (ST) complementary for natural language understanding (NLU)?
In this paper, we show that TAPT and ST can be complementary with simple TFS protocol by following TAPT → Finetuning → Self-training process (TFS). TFS protocol follows three steps: (1) TAPT on unlabeled corpus drawn from a task (2) Standard supervised finetuning on labeled data inheriting parameters from TAPT as initialization to train a teacher model (3) Teacher model generates pseudo labels for the same unlabeled corpus in (1) and trains a student model in a self-training framework until convergence or reaching maximum rounds as shown in Figure 1. The first step utilizes unlabeled corpus in a task-agnostic way to learn general linguistic representations while the third one utilizes unlabeled corpus in a task-specific way during pseudo-labeling process. Therefore, unlabeled data are utilized twice through two different ways by taking advantages of TAPT and ST. TFS can effectively utilize unlabeled data to achieve strong combined gains of TAPT and ST consistently across six datasets covering sentiment classification, paraphrase identification, natural language inference, named entity recognition and dialogue slot classification. We further investigate various semi-supervised settings and consistently show that gains from TAPT and ST can be strongly additive by following TFS procedure.

Related Work
Pre-training. Unsupervised or self-supervised pretraining have achieved remarkable successes in natural language processing (Devlin et al., 2019;Radford et al., 2019;Raffel et al., 2020;Brown et al., 2020). However, these models are pre-trained on a very large general domain corpus, e.g. Wikipedia, and may limit their perfor-mance on a specific task due to distribution shift Wu et al., 2020;Gururangan et al., 2020). To better handle aforementioned issue, domain-adaptive pre-training (DAPT) by continuing pre-training of existing language models, e.g. BERT and RoBERTa, on a large corpus of unlabeled domain-specific text data has been proposed and achieved great successes in specific domains (Gururangan et al., 2020;Wu et al., 2020).  proposed BioBERT by continuing pre-training of BERT on biomedical domain corpus and outperformed BERT in biomedical text mining significantly. Following a similar idea, Wu et al. (2020) proposed ToD-BERT by continuing pre-training of BERT on nine dialogue datasets for NLU tasks in task-oriented dialogue systems and achieved great successes in various few-shot NLU tasks in dialogue domain. Gururangan et al. (2020) took one step further and continued pre-training of language models on a much smaller amount of unlabeled data but drawn from the same distribution for a given task (TAPT), which not only can achieve competitive results with DAPT but also is complementary with it.
Self-training. Self-training as one of the earliest and simplest semi-supervised learning has recently shown state-of-the-art performance for tasks like image classification (Xie et al., 2020;Sun et al., 2019), object detection (Zoph et al., 2020) and can perform at par with fully supervised models while using much less labeled training data. On natural language processing, Mukherjee and Awadallah (2020) applied self-training for few-shot text classification and incorporated uncertainty estimation of the underlying neural network for unlabeled data selection. Wang et al. (2021) improved self-training with meta-learning by adaptive sample reweighting to mitigate error propagation from noisy pseudolabels for named entity recognition and slot tagging in task-oriented dialog systems. He et al. (2020) injected noise to the input space as a noisy version of self-training for neural sequence generation and obtained state-of-the-art performance for tasks like neural machine translation. Du et al. (2021) utilized information retrieval to retrieve task-specific indomain data from a large amount of web sentences for self-training. Beyond these applications of self-training, Wei et al. (2021) further theoretically proved that self-training and input-consistency regularization will achieve high accuracy in regard to ground-truth labels under certain assumptions.
There also exist works combing pre-training with self-training. Chen et al. (2020b) first conducted self-supervised pre-training with SimCLR (Chen et al., 2020a) on ImageNet (Russakovsky et al., 2015) in a task-agnostic way, then finetuned pre-trained models on limited labeled data and finally did self-training/knowledge distillation (Hinton et al., 2015) via the same unlabeled examples as pre-training in a task-specific way. Such a framework enables models to make use of data twice in both pre-training and self-training/knowledge distillation stage. Xu et al. (2020) followed this framework on speech recognition and achieved state-ofthe-art performance only with very limited labeled data. However, it's unclear that language models like BERT that have already been pre-trained in a very large general corpus can benefit this framework or not since Chen et al. (2020b) and Xu et al. (2020) conducted pre-training from scratch. In addition, they only did self-training in one round, making it unclear whether iterative self-training without pre-training can achieve comparable results in the end. A recent work (Du et al., 2021) did both continuing pre-training and self-training in retrieved data from open domains but only observe gains for self-training while our work utilizes existing in-domain unlabeled data and found that both TAPT and self-training are effective.

Problem setup
Assuming that we can only access a small amount of labeled data along with a much larger amount of unlabeled data, our goal is to fully leverage unlabeled data D u to improve model performance.

Task-adaptive Pre-training (TAPT)
One simple yet effective way to improve BERT-like models with unlabeled data is task-adaptive pretraining (TAPT). The approach of TAPT is quite straightforward -simply continuing pre-training BERT-like models with masked language modeling (MLM) (Devlin et al., 2019) on unlabeled text data for a specific given task (Gururangan et al., 2020).
Specifically, during MLM process, a proportion of randomly sampled tokens in the input are masked out with special token [MASK]. We conduct dynamical token masking during pre-training following Wu et al., 2020). The training objective of MLM is the cross entropy loss to reconstruct masked tokens:

Self-training (ST)
Self-training begins with a teacher model p t trained on the labeled data D l . The teacher model is used to generate pseudo labels for unlabeled data D u . The augmented data D l ∪ D u is then used to train a student model p s . Specifically, ∀x j ∈ D u , we use teacher model to generate its soft label and then student model is trained with standard crossentropy loss for labeled data and KL divergence for unlabeled data, which can be formulated as: where teacher model p t is fixed in the current round. After training of student model with objective L st , it is assigned as a new teacher model in the next round and the teacher-student training procedure is repeated until convergence or reaching maximum rounds.

TAPT → Finetuning → Self-training (TFS)
Although TAPT has been proven effective to utilize unlabeled data, it's task-agnostic in the sense that it's unaware of specific tasks, e.g. sentence classification or name entity recognition. This paradigm learns general linguistic representations buried under unlabeled data, which are not directly tailored to a specific task. Utilizing data in a task-agnostic may lose the information of unlabeled data key to the task at hand. On the contrary, self-training utilizes unlabeled data in a task-specific way. Pseudo labels are obtained through trained models and taskspecific information can be encoded into pseudo labels. However, this method may only work well  (1) TAPT on unlabeled corpus drawn from a task (2) Train a teacher model on labeled data with TAPT as initialization (3) Teacher generates pseudo labels from share unlabeled corpus with (1) and trains a student model with both labeled and pseudo labeled data in an iterative self-training framework.
when a considerable portion of the predictions on unlabeled data are correct (He et al., 2020), otherwise early mistakes made by teacher model p t due to limited labeled data can reinforce itself by generating incorrect labels for unlabeled data and re-training on this data can even result in a worse student model p s in the next round (Zhu and Goldberg, 2009).
TFS protocol by following TAPT → Finetuning → Self-training (TFS) process can take advantages of TAPT and ST but at the same time avoid their weakness. The overall pipeline of TFS is shown in Figure 1. TFS first utilizes unlabeled data in a taskagnostic way by TAPT to have a better initialization for finetuning in next step and then finetunes a teacher model initializing its parameters from TAPT on labeled data in a standard supervised way. These two steps can build a better teacher model, avoid early mistakes and generate more accurate Apply p τ to the unlabeled corpus D u to obtainD u := {(x j , p τ (x j ))|∀x j ∈ D u } 5: Train a student model p τ on D l ∪D u by Equation 2 6: Assign p τ as a teacher for the next round 7: until Convergence or maximum rounds are reached predictions for students, which is key to the success of self-training. The unlabeled data is leveraged again during self-training process in a task-specific way to further boost the performance of models at hand. We summarize the workflow of TFS in Algorithm 1.

Experiments
Here we conduct comprehensive experiments and analysis on different NLU datasets to demonstrate the effectiveness of TFS.

Experimental Setup
We use six popular large-scale datasets covering sentiment classification, paraphrase identification, natural language inference, named entity recognition and dialogue slot classification as follows.
(1) SST-2 consists of sentences from movie reviews and human annotations of their sentiment (Socher et al., 2013) . The task is to predict the sentiment of a given sentence to be positive or negative (Wang et al., 2019b).
(2) Both QNLI (Wang et al., 2019b) and MNLI (Williams et al., 2018) are natural language inference datasets. QNLI is adapted from the SQuAD (Rajpurkar et al., 2016) question answering dataset and the task is to predict whether the context sentence includes the answer to a given question (Wang et al., 2019b), which can be regarded as a binary classification problem. MNLI is slightly different from QNLI as it has multiple genres. Specifically, a premise sentence and a hypothesis sentence are given for each example in MNLI, and the task is to predict whether the given premise entails (entailment), contradicts (contradiction) the given hypothesis, or neither of them (neutral) ( 2019b).
(3) QQP is a paraphrase identification dataset (Chen et al., 2018). The goal is to determine if two questions asked on Quora are semantically equivalent (Wang et al., 2019b), which can also be formulated as a binary classification problem.
(4) CoNLL 2003 is a name entity recognition dataset and the task is to recognize four types of named entities: persons, locations, organizations and miscellaneous entities, where miscellaneous type does not fall into any of the previous three categories (Tjong Kim Sang and De Meulder, 2003).
(5) MultiWOZ 2.1 is a large-scale multidomain dialogue dataset with human-human conversations (Eric et al., 2020). We convert each dialogue into turns and the task is to predict whether a slot, e.g. restaurant name, is mentioned in a turn and can be cast as a multi-label binary slot classification problem (Li et al., 2021).
SST-2, QNLI, MNLI and QQP datasets are from GLUE benchmark 1 and we only report their results on development sets as extensive experiments don't allow us to submit predictions on their test sets to the official leaderboard due to submission limitations 2 . Note that for both MNLI and QQP, we randomly downsample their training sets into 100K and development sets into 5K otherwise iterative self-training in various semi-supervised setups can be too costly and for MNLI, we report results on the matched development set. On both CoNLL 2003 and MultiWOZ 2.1, we report results of their test sets. For SST-2, MNLI and QNLI, we use standard accuracy metric and for QQP and CoNLL 2003 we report their F1 scores. For MultiWOZ 2.1, we report micro-F1. We summarize details of each dataset including task, full training data size, num-1 We only consider datasets with training data size larger than 10K in GLUE benchmark.
2 See more about FAQ 1 at https:// gluebenchmark.com/faq ber of classes and their evaluation metric in Table 1.
TAPT. We use BERT-base and BERT-large as our backbone to leverage both labeled and unlabeled data. Both labeled and unlabeled data are used for TAPT in our implementation so that we can use the same checkpoint for different data split and labeled data size without repeating costly pretraining process on the same dataset. During TAPT process, we use MLM objective with random token masking probability 0.15 for each training set listed in Table 1 following previous work (Devlin et al., 2019;. Finetuning. We follow standard supervised finetuning paradigm (Devlin et al., 2019) by adding a linear projection layer with weight W ∈ R K×I on top of BERT in labeled data for each dataset listed in Table 1, where K is the number of classes and I is the dimensionality of representations of BERT. Specifically, for SST-2, QNLI, MNLI and QQP, we pass the representation of [CLS] token H CLS to a linear layer followed by a Softmax function. Models are trained with cross-entropy loss between the predicted distributions Softmax(W (H CLS )) and their ground truth labels. For CoNLL 2003 name entity recognition task, we feed the representation of each token into a linear layer followed by a Softmax function. Models are trained with average cross-entropy loss between the predicted distributions and their labels over all tokens 3 . For multilabel binary slot classification task on MultiWOZ 2.1, we pass the representation of [CLS] token H CLS into a linear layer followed by a Sigmoid function. Models are trained with mean binary cross-entropy loss between the predicted distributions Sigmoid(W (H CLS )) and their ground truth labels. Self-training. We use the finetunned models with  labeled data as teachers to generate pseudo soft labels on unlabeled data following Du et al. (2021). Pseudo labeled data are combined with original labeled data to trained student models by optimizing objective function in Equation 2. In the first round, students utilize the same pre-trained checkpoints as their teachers and in the following rounds, students inherit parameters from teachers. We set maximum rounds as 3 since we observe that setting a much larger round brings the same results or very marginal gains on both SST-2 and CoNLL 2003.

Main results
In this section, we simulate data scarcity scenarios for these mentioned datasets in Table 1 for both BERT base and BERT large . Specifically, for each dataset we randomly sample 1% training data as labeled corpus and left 99% as unlabeled data. Both labeled and unlabeled corpus is used as the input of TAPT while only unlabeled corpus is used for selftraining. For all datasets, we randomly choose three data splits and have three different runs for each of them except BERT large on CoNLL 2003 and MultiWOZ 2.1 to combat their instability by leveraging their results on development sets. In these two datasets, we use ten different runs for each data split on BERT large and report corresponding test set results based on top three runs on development sets. Results are summarized in Table 2

Varying size of labeled data
We have demonstrated the effectiveness of TFS on both BERT base and BERT large in six different datasets with 1% training data in section 4.2. We further explore different sizes of labeled data on six datasets in Table 1 with BERT base model. Specifically, on relatively simple dataset SST-2, we vary labeled data ratio as {0.1%, 0.2%, 0.5%, 1.0%} and on more difficult QNLI, MNLI and QQP datasets, we vary their labeled ratios as {1%, 2%, 5%, 10%}. For CoNLL 2003, we explore labeled data ratio in {0.5%, 0.6%, 0.7%, 0.8% 0.9%, 1.0%} and for MultiWOZ 2.1, we set labeled data ratio as {1%, 2%, 3%, 4%, 5%}. Following previous settings, for each labeled data ratio among these six datasets, we randomly select 3 data splits and each data split has three different runs. Final average results for each data ratio are reported over these nine runs. To better measure additive property of TFS, we introduce TAPT+ST in our results for references, which directly adds performance gains of TAPT and ST on FT.
Results of six different datasets among different sizes of labeled data are summarized in Figure 2. TAPT outperforms ST in SST-2, QNLI, QQP and MutiWOZ 2.1 datasets in various labeled data setups but underperforms it on MNLI and CoNLL 2003, indicating that TAPT and ST learn different representations from unlabeled data and have their pros and cons. However, TFS consistently and significantly outperforms TAPT and ST in all scenarios among six different datasets and again proves its effectiveness over TAPT and ST alone. For example, in CoNLL 2003 with 0.5% labeled data, TFS has relative 4.4% and 3.1% improvement over TAPT and ST, respectively. More importantly, TFS overall has very similar results with TAPT+ST in various labeled data size of different datasets, which further strengthens that TFS protocol can yield strong additive gains over TAPT and ST.

Analysis
Given the promising results in the previous experiments, we aim to answer why TFS outperforms ST consistently and significantly. Indeed, the differences between TFS and ST lie in two aspects: (1) TFS uses initialization from the checkpoint of TAPT rather than original BERT as ST.
(2) TFS utilizes pseudo labels generated from TAPT finetuned models while ST uses pseudo labels generated from BERT finetuned models (FT). To further investigate these two perspectives, we design a variant of original ST, ST with TAPT Initilization (STTI), which utilizes pseudo labels generated by BERT finetuned models as ST but is initialized with the same checkpoints from TAPT as TFS during the first round of self-training. The intermediate variant can help us better understand what makes TFS work. We run experiments on SST-2 with 0.1% and 1.0% labeled data for BERT base to compare ST, STTI and TFS. The results of STTI are obtained by running over the same three data splits as ST and TFS, and having three different runs for each data split. Results are averaged and summarized in Table 3. Importance of initialization. Table 3 shows that STTI consistently outperforms ST in both 0.1% and 1% labeled setup. Comparing its difference with ST, we can conclude that its improvement over ST comes from its TAPT initialization. Results of MNLI and CoNLL 2003 in Figure 2 (c) and (e) also validate the importance of initialization. In these two datasets, although ST can consistently generate more accurate labels than TAPT finetuned models, meaning that it can match TAPT finetuned performance during self-training process, it still underperforms TFS in the end. These results again indicate the importance of initialization. Without TAPT as initialization, even if ST itself can outper-  show initialization and pseudo labeler of different models, respectively. Last two rows list accuracy with 0.1% and 1.0% labeled training data of these models. * represents models without finetuning on labeled data.
form TAPT finetuned models, who are teachers of TFS in self-training process, but still at its end will be left behind of TFS. Importance of pseudo label correctness. Table 3 also shows that STTI underperforms TFS in both 0.1% and 1% labeled setup although both of them inherit the same parameters from TAPT. These results indicate that beyond initialization, accurate pseudo labels also matter for self-training process. STTI takes pseudo labels generated from BERT finetuned baselines (FT) that have more errors while TFS utilizes more accurate pseudo labels generated from TAPT finetuned models. Suffering from more incorrect pseudo labels in the beginning of self-training process, STTI may converge to a worse local optima than that of TFS. This is even more severe when labeled data is 0.1% and FT is left far behind of TAPT finetuned models, causing that STTI has 10.3% accuracy gap compared to TFS. These results prove the importance of accurate pseudo labels for self-training. Combining these findings, we argue that TFS can outperform ST at least for two reasons: (1) it has a better initialization from TAPT compared to ST from BERT (2) it utilizes more accurate pseudo labels from TAPT finetuned models than ST.

Conclusion
In this paper, we demonstrate that TAPT and ST are complementary for NLU tasks with TFS by following TAPT → Finetuning → Self-training process. Our extensive experiments in various semi-supervised setups across six popular datasets show that they are not only complementary but also strongly additive with TFS protocol. We further show that TFS outperforms ST through (1) a better initialization from TAPT (2) more accurate predictions from TAPT finetuned models. We hope that TFS could serve as an important semi-supervised baseline for future NLP studies.