Revisiting Pretraining with Adapters

Pretrained language models have served as the backbone for many state-of-the-art NLP results. These models are large and expensive to train. Recent work suggests that continued pretraining on task-specific data is worth the effort as pretraining leads to improved performance on downstream tasks. We explore alternatives to full-scale task-specific pretraining of language models through the use of adapter modules, a parameter-efficient approach to transfer learning. We find that adapter-based pretraining is able to achieve comparable results to task-specific pretraining while using a fraction of the overall trainable parameters. We further explore direct use of adapters without pretraining and find that the direct fine-tuning performs mostly on par with pretrained adapter models, contradicting previously proposed benefits of continual pretraining in full pretraining fine-tuning strategies. Lastly, we perform an ablation study on task-adaptive pretraining to investigate how different hyperparameter settings can change the effectiveness of the pretraining.


Introduction
Pretrained Language Models (PLM) are predominant in tackling current Natural Language Processing (NLP) tasks. Most PLMs based on the Transformer architecture (Vaswani et al., 2017) are first trained on massive text corpora with the selfsupervised objective to learn word representations (Devlin et al., 2019;Liu et al., 2019), and then are fine-tuned for a specific target task. The pretraining and fine-tuning of PLMs achieves state-ofthe-art (SOTA) performance in many NLP tasks. Inspired by the benefits of pretraining, there have been studies demonstrate the effects of continued pretraining on the domain of a target task or the target task dataset (Mitra et al., 2020;Han and Eisenstein, 2019;Gururangan et al., 2020). Gururangan et al., 2020 adapt PLMs on the target task by further pretraining RoBERTa (Liu et al., 2019) on the target text corpus before it is fine-tuned for the corresponding task and showed that this task adaptation consistently improves the performance for text classification tasks.
However, this full process of pretraining and then fine-tuning can be parameter inefficient for recent PLMs that have millions or billions of parameters (Devlin et al., 2019;Radford et al., 2018). This parameter inefficiency becomes even worse when one continues pre-training all the parameters of PLMs on the task-specific corpus. Furthermore, recent PLMs need more than 100s of MB to store all the weights (Liu et al., 2019;Radford et al., 2018), making it difficult to download and share the pre-trained models on the fly.
Recently, adapters have been proposed as an alternative approach to decrease the substantial number of parameters of PLMs in the fine-tuning stage (Houlsby et al., 2019). Finetuning with adapters mostly matches the performance of those with the full fine-tuning strategy on many NLP tasks including GLUE benchmark (Wang et al., 2018) and reduces the size of the model from 100s of MB to the order of MB (Pfeiffer et al., 2020b). As such, a natural question arises from the successes of the adapter approach: can the adapter alone adapt PLMs to the target task when it is used in the second phase of the pretraining stage and thus lead to the improvement of the performance on the corresponding task?
In this paper, we explore task-adaptive pretraining, termed TAPT (Gururangan et al., 2020), with adapters to address this question and overcome the limitations of the conventional full pretraining and fine-tuning. We only train the adapter modules in the second phase of pretraining as well as the fine-tuning stage to achieve both parameter efficiency and the benefits of continual pretraining and compare those with the adapter-based model without pretraining. Surprisingly, we find that directly fine-tuning adapters performs mostly on par with the pre-trained adapter model and outperforms the full TAPT, contradicting the previously proposed benefits of continual pretraining in the full pretraining fine-tuning scheme. As directly fine-tuning adapters skips the second phase of pretraining and the training steps of adapters are faster than those of the full model, it substantially reduces the training time. We further investigate different hyperparameter settings that affect the effectiveness of pretraining.

Pretraining and Adapters
Pre-trained language model We use RoBERTa (Liu et al., 2019), a Transformer-based language model that is pre-trained on a massive text corpus, following Gururangan et al., 2020. RoBERTa is an extension of BERT (Devlin et al., 2019) with optimized hyperparameters and a modification of the pretraining objective, which excludes next sentence prediction and only uses the randomly masked tokens in the input sentence. To evaluate the performance of RoBERTa on a certain task, a classification layer is appended on top of the language model after the pretraining and all the parameters in RoBERTa are trained in a supervised way using the label of the dataset. In this paper, training word representations using RoBERTa on a masked language modeling task will be referred to as pretraining. Further, taking this pretrained model and adding a classification layer with additional updates to the language model parameters will be referred to as fine-tuning.
Task-adaptive pretraining (TAPT) Although RoBERTa achieves strong performance by simply fine-tuning the PLMs on a target task, there can be a distributional mismatch between the pretraining and target corpora. To address this issue, pretraining on the target task or the domain of the target task can be usefully employed to adapt the language models to the target task and it further improves the performance of the PLMs. Such methods can be referred to as Domain-Adaptive Pretraining (DAPT) or Task Adaptive-Pretraining (TAPT) (Gururangan et al., 2020). In this paper, we limit the scope of our works to TAPT as domain text corpus is not always available for each task, whereas TAPT can be easily applied by directly using the dataset of the target task while its performance often matches with DAPT (Gururangan et al., 2020). In TAPT, the second phase of pretraining is per- Figure 1: The adapter achitecture in the Transformer layer (Pfeiffer et al., 2020a) formed with RoBERTa using the unlabeled text corpus of the target task, and then it is fine-tuned on the target task.
Adapter Adapter modules have been employed as a feature extractor in computer vision (Rebuffi et al., 2017) and have been recently adopted in the NLP literature as an alternative approach to fully fine-tuning PLMs. Adapters are sets of new weights that are typically embedded in each transformer layer of PLMs and consist of feed-forward layers with normalizations, residual connections, and projection layers. The architectures of adapters vary with respect to the different configuration settings. We use the configuration proposed by Pfeiffer et al., 2020a in Figure 1, which turned out to be effective on diverse NLP tasks, and add the adapter layer to each transformer layer.
Pfeiffer et al., 2020c use two types of adapter: language-specific adapters and taskspecific adapters for cross-lingual transfer. These two types of adapter modules have similar architecture as in Figure 1. However, the language adapters involve invertible adapters after the embedding layer to capture token-level language representation when those are trained via masked language modeling in the pretraining stage, whereas the task adapters are simply embedded in each transformer layer and trained in the fine-tuning stage to learn the task representation. Following Pfeiffer et al., 2020c, we employ language adapter modules with invertible adapter layers to perform pretraining adapters on the unlabeled target dataset. However, we perform fine-tuning pre-trained parameters of the language adapter modules for evaluation to align with

Experiments
We now propose an adapter-based approach that is a parameter efficient variant of Task-Adaptive Pretraining (TAPT) and measure the margin of the performance between the pre-trained adapter model and the adapter model without pretraining. For pretraining adapters, we added the adapter module in each transformer layer of RoBERTa using adaptertransformer (Pfeiffer et al., 2020b) 1 and continued pretraining all the weights in adapter layers on target text corpus while keeping the original parameters in RoBERTa fixed. After finishing the second phase of pretraining, we performed fine-tuning of RoBERTa by training the weights in the adapters and the final classification layers while keeping all of the parameters in RoBERTa frozen.

Dataset
Following Gururangan et al., 2020 2 , we consider 8 classification tasks from 4 different domains. The specification of each task is shown in Table 1. We covered news and review texts that are similar to the pretraining corpus of RoBERTa as well as scientific domains in which text corpora can have largely different distributions from those of RoBERTa. Furthermore, the pretraining corpora of the target tasks include both large and small cases to determine whether the adapter-based approach can be applicable in both low and high-resource settings.

Implementation Details
Our implementation is based on HuggingFace since we found AllenNLP (Gardner et al., 2018) used in Gururangan et al., 2020 is incompatible with adapter-transformer (Pfeiffer et al., 2020b). We follow the hyperparameters setting in Gururangan et al., 2020, and each model in the pretraining and fine-tuning stage is trained on a single GPU (NVIDIA RTX 3090). Details of hyperparameters are described in Appendix A. Note that for the pretraining step, we use a batch size of 8 and accumulate the gradient for every 32 steps to be consistent with the hyperparameter setting in Gururangan et al., 2020. We perform pretraining with the self-supervised objectives, which are randomly masked tokens, with a probability of 15% for each epoch and we do not apply validation to pretraining and save the model at the end of the training from a single seed. For TAPT, we train the entire parameters of the RoBERTa via masked language modeling (MLM) on the target dataset, whereas for the adapter-based model, we embed the language adapters in each transformer layer and add invertible adapters after the embedding layers to perform MLM while freezing the original parameters of RoBERTa, following Pfeiffer et al., 2020c. Fine-tuning step is straightforward. We perform fine-tuning parameters that are pretrained via MLM for both TAPT and the adapter model. Validation is performed after each epoch and the best checkpoint is loaded at the end of the training to evaluate the performance on the test set.

Experimental setup
Experiments cover four different models. First, we reproduce the performance of RoBERTa and TAPT in Gururangan et al., 2020 as presented in Appendix C. Then we proceed to the adapter-based approach.  Table 2: Average F 1 score with standard deviation on test set. Each score is averaged over 5 random seeds. Evaluation metric is macro-F 1 scores on test set for each task except for CHMEPROT and RCT which use micro-F 1 . We report the results of baseline RoBERTa and TAPT from Gururangan et al., 2020. Following Rücklé et al., 2020, we measure the average relative speed for the training and the inference time across all tasks except for the the inference speed in fine-tuning stage, which excludes low-resource tasks. PT and FT indicate pretraining and fine-tuning respectively.
To investigate the benefits of task-adaptive pretraining with adapters, we compare the performance of the pre-trained adapter model with the model without pretraining, i.e., directly fine-tuning adapters in RoBERTa on the target task. For the adapter-based approach, we compare the adapter-based model with the second phase of pretraining and the model without the pretraining. Since the weights of the adapters are randomly initialized, we empirically found that a larger learning rate worked well compared to the full fine-tuning experiments. We sweep the learning rates in {2e-5, 1e-4, 3e-4, 6e-4} and the number of epochs in {10, 20} on the validation set and report the test score that performs the best on the validation set.

Results
The results are summarized in Table 2. Surprisingly, for the average F 1 score, the adapter-based model without task-adaptive pretraining performs best, followed by the other adapter with the pretraining model, TAPT, and the baseline RoBERTa. Except for Hyperpartisan news, the adapter model without pretraining performs mostly on par with the counterpart adapter model that involves pretraining on target text corpus, suggesting that the benefits of additional task-adaptive pretraining diminish when we use the adapter-based approach. Furthermore, directly fine-tuned adapter model only trains 1.42% of the entire parameters which leads to the 30% faster-training step than the full model and skips the pretraining stage that typically expensive to train than the fine-tuning, substantially reducing Figure 2: F 1 score as a function of learning rate on test set with log scale on x-axis. F 1 score is averaged over 5 random seeds for low-resource tasks (CHEMPROT, ACL-ARC, SCIERC, HYPER) due to the high variance. For high-resource tasks (RCT, AGNEWS, HELPFULNESS, IMDB), we report the F 1 score from a single random seed for each task. For RoBERTa and TAPT, we follow the hyper-parameter settings in Gururangan et al., 2020 except for the learning rate. the training time while the relative speed for the inference only decreases by 2% to the full model.

Analysis
We analyze how the adapter alone can surpass or perform on par with both the full model and adapter model with task-adaptive pretraining. Since we sweep the learning rates and the number of epochs in the range that includes larger figures compared to those in the full model when fine-tuning adapters and kept the other hyper-parameters the same as in Gururangan et al., 2020, we hypothesize that  Table 3: Best performance of baseline RoBERTa and TAPT (Gururangan et al., 2020) on our implementation. Each score is averaged over 5 random seeds. Best configuration settings for each task is described in Appendix Table 8.
the larger learning rate zeroes out the benefits of pretraining. Figure 2. shows the average F 1 score across all tasks as a function of learning rate. The adapter model without a second phase of pretraining consistently outperforms or performs on par with the adapter model with pretraining from 1e-4 to 6e-4, demonstrating that the additional pretraining turns out to be ineffective. In contrast, TAPT outperforms baseline RoBERTa from 2e-5, where both TAPT and baseline RoBERTa perform best. The results show that different learning rates used in the fine-tuning stage can affect the effectiveness of pretraining and demonstrate that directly fine-tuning a fraction of parameters can provide comparable performance to the full-model as well as the adapter model with pretraining while substantially reducing the training time.
Inspired by the results of the adapter models, we perform the same experiments for the full model (baseline RoBERTa and TAPT) on our implementation by sweeping the learning rates and the number of epochs. We hypothesize that proper hyperparameter settings such as a larger learning rate or increasing the number of training steps in the fine-tuning stage can improve the performance of baseline RoBERTa, making pretraining on the unlabeled target task less effective. We sweep the learning rates in {1e-5, 2e-5, 3e-5} and the number of epochs in {10, 20} on the validation set and report the test score that performs the best on the validation set. Table 3 shows the best performance of the full models for each task among different hyper-parameter settings. The average F 1 score of baseline RoBERTa greatly increases and surprisingly, it surpasses the performance of TAPT in some tasks. The results ensure that although pretraining PLMs on the target task results in better performance, one can achieve comparable performance by simply using a larger learning rate or increasing training steps in the fine-tuning stage while skipping the pretraining step that is computationally demanding compared to the fine-tuning.

Conclusion
Our work demonstrates that adapters provide a competitive alternative to large-scale task-adaptive pretraining for NLP classification tasks. We show that it is possible to achieve similar performance to TAPT with pretraining training just 1.32% of the parameters through pretraining with adapters. However, the most computationally efficient option is to skip pretraining and only perform fine-tuning with adapters. We found that skipping pretraining altogether and just fine-tuning with adapters outperforms or performs mostly on par with TAPT and the adapter model with pretraining across our tasks while substantially reducing the training time.

A Hyperparameter Details
Details of hyperparameter setting including the learning rates for the best performing results are provided in Table 4, 5, and 6.

B Validation Results
We present validation performance in Table 7 and Figure 3 and 8.

C Replication results
We provide replication results of Gururangan et al., 2020 in Table 9.     Table 7: Validation performance of adapter experiments. Each score is averaged over 5 random seeds. Evaluation metric is macro-F 1 scores for each task except for CHMEPROT and RCT which use micro-F 1 . Figure 3: F 1 score as a function of learning rate on development setwith log scale on x-axis. F 1 score is averaged over 5 random seeds for low-resource tasks (CHEMPROT, ACL-ARC, SCIERC, HYPER) due to the high variance. For high-resource tasks (RCT, AGNEWS, HELPFULNESS, IMDB), we report the F 1 score from a single random seed for each task. Here we sweep the learning rates in {1e-4, 3e-4, 6e-4}, the number of epochs in {10, 20}, and the patience factor in {3, 5}.   Table 3. Each score is averaged over 5 random seeds.