On the Effectiveness of Adapter-based Tuning for Pretrained Language Model Adaptation

Adapter-based tuning has recently arisen as an alternative to fine-tuning. It works by adding light-weight adapter modules to a pretrained language model (PrLM) and only updating the parameters of adapter modules when learning on a downstream task. As such, it adds only a few trainable parameters per new task, allowing a high degree of parameter sharing. Prior studies have shown that adapter-based tuning often achieves comparable results to fine-tuning. However, existing work only focuses on the parameter-efficient aspect of adapter-based tuning while lacking further investigation on its effectiveness. In this paper, we study the latter. We first show that adapter-based tuning better mitigates forgetting issues than fine-tuning since it yields representations with less deviation from those generated by the initial PrLM. We then empirically compare the two tuning methods on several downstream NLP tasks and settings. We demonstrate that 1) adapter-based tuning outperforms fine-tuning on low-resource and cross-lingual tasks; 2) it is more robust to overfitting and less sensitive to changes in learning rates.

Adapter-based tuning has recently arisen as an alternative to fine-tuning. It works by adding light-weight adapter modules to a pretrained language model (PrLM) and only updating the parameters of adapter modules when learning on a downstream task. As such, it adds only a few trainable parameters per new task, allowing a high degree of parameter sharing. Prior studies have shown that adapter-based tuning often achieves comparable results to finetuning. However, existing work only focuses on the parameter-efficient aspect of adapterbased tuning while lacking further investigation on its effectiveness. In this paper, we study the latter. We first show that adapterbased tuning better mitigates forgetting issues than fine-tuning since it yields representations with less deviation from those generated by the initial PrLM. We then empirically compare the two tuning methods on several downstream NLP tasks and settings. We demonstrate that 1) adapter-based tuning outperforms fine-tuning on low-resource and cross-lingual tasks; 2) it is more robust to overfitting and less sensitive to changes in learning rates.

Introduction
Large scale pretrained language models (PrLMs) (Devlin et al., 2019;Liu et al., 2019;Conneau et al., 2020a;Brown et al., 2020) have achieved state-ofthe-art results on most natural language processing (NLP) tasks, where fine-tuning has become a dominant approach to utilize PrLMs. A standard finetuning process copies weights from a PrLM and tunes them on a downstream task, which requires a new set of weights for each task.
Adapter-based tuning (Houlsby et al., 2019;Bapna and Firat, 2019) has been proposed as a more parameter-efficient alternative. For NLP, adapters are usually light-weight modules inserted between transformer layers (Vaswani et al., 2017). During model tuning on a downstream task, only the parameters of adapters are updated while the weights of the original PrLM are frozen. Hence, adapter-based tuning adds only a small amount of parameters for each task, allowing a high degree of parameter-sharing. Though using much less trainable parameters, adapter-based tuning has demonstrated comparable performance with full PrLM fine-tuning (Houlsby et al., 2019;Bapna and Firat, 2019;Stickland and Murray, 2019).
Existing work mostly focuses on the parameterefficient aspect of adapters and attempt to derive useful applications from that, which is still the case in most recent works: Rücklé et al. (2020) explore methods to further improve the parameter and computation efficiency of adapters; Pfeiffer et al. (2020a) combine knowledge from multiple adapters to improve the performance on downstream tasks; Artetxe et al. (2020) and Pfeiffer et al. (2020c) leverage the modular architecture of adapters for parameter-efficient transfer to new languages or tasks, and Wang et al. (2020) utilize the same property for knowledge injection.
Besides parameter-efficiency, the unique characteristic of adapter-based tuning, with alternating frozen and learnable layers, might be directly useful for improving model performances. However, this has not yet been discussed in the prior work. In this paper, we first empirically demonstrate that adapter-based tuning better regularizes training than fine-tuning by mitigating the issue of forgetting. We show that it yields representations with less deviation from those generated by the original PrLM. Next, to see what this property of adapters will help when adapting PrLMs, we compare the performance of fine-tuning and adapter-based tuning on a wide range of datasets

Transformer Layer Adapter
Layer Norm Figure 1: The structure of the adapter adopted from Houlsby et al. (2019). N is the number of transformer layers. and NLP tasks. Extensive experiments and analysis are conducted in different settings, including low-resource and high-resource, monolingual and cross-lingual. Our main findings can be summarized as follows: • For monolingual adaptation, adapter-based tuning yields better results in low-resource settings, especially when the task is more domain-specific. With increasing training samples, the performance gain over finetuning is less significant ( §3).
• Adapter-based tuning tends to outperform fine-tuning on zero-shot cross-lingual tasks under different amounts of training data ( §4).
• Adapter-based tuning demonstrates higher stability and better generalization ability. It is less sensitive to learning rates compared to fine-tuning ( §5).

Adapter-based Tuning
When adapting a pretrained language model (PrLM), adapter-based tuning inserts light-weight neural networks (adapters) between the transformer layers of the PrLM, and only updates the parameters of the adapters on a downstream task, but keeps the ones of the PrLM frozen. Unlike fine-tuning which introduces an entire new model for every task, one great advantage of adapter-based tuning is generating a compact Fine-tune vs Base Adapter vs Base Figure 2: Comparison of the representations obtained at each layer before (Base) and after adapter-based tuning or fine-tuning on BERT-base using Representational Similarity Analysis (RSA). 5000 tokens are randomly sampled from the dev set for computing RSA.
A higher score indicates that the representation spaces before and after tuning are more similar.
model with only a few trainable parameters added per task. Houlsby et al. (2019) have extensively studied the choices of adapter architectures and where they should be inserted into PrLMs. They find that a stack of down-and up-scale neural networks works well which only introduces a small amount of extra parameters to the network. This design inspires most of the following work (Pfeiffer et al., 2020a,c;Bapna and Firat, 2019). As shown in Figure 1, the adapter maps an input hidden vector h from dimension d to dimension m where m < d, and then re-maps it to dimension d. We refer m as the hidden size of the adapter. A skip-connection is employed inside the adapter network such that if the parameters of the projection layers are near zeros, the adapter module approximates an identity function. Formally, given the input hidden vector h, the output vector h is calculated as: in which f 1 (·) and f 2 (·) are the down-and upprojection layers. At each transformer layer, two adapters are inserted right after the self-attention and the feed-forward layers respectively. During adapter tuning, only the parameters of the adapters, the normalization layers, and the final classification layer are updated. We use the above described adapter configuration in all of our experiments, since it is adopted in most prior work with few modifications.

Representation Similarity
Fine-tuning large-scale PrLMs on downstream tasks can suffer from overfitting and bad generalization issues (Dodge et al., 2020;Phang et al., 2018). Recently, Lee et al. (2020) propose Mixout to regularize the fine-tuning of PrLMs. They show that Mixout avoids catastrophic forgetting and stabilizes the fine-tuning process by encouraging the weights of the updated model to stay close to the initial weights. Since adapter-based tuning does not update the weights of PrLMs at all, we suspect that it has a similar effect of alleviating the issue of catastrophic forgetting. Since the weights of the PrLM are the same before and after adapter-based tuning, to verify this, we use Representational Similarity Analysis (RSA) (Laakso and Cottrell, 2000) to assess the similarity of tuned representations to those without tuning at each transformer layer. RSA has been widely used to analyze the similarity between two neural network outputs (Abnar et al., 2019;Chrupała and Alishahi, 2019;Merchant et al., 2020), which works by creating two comparable sets of representations by inputting a same set of n samples to the two models. For each set of representations, a n × n pairwise similarity 1 matrix is calculated. The final RSA similarity score between the two representation space is computed as the Pearson correlation between the flattened upper triangulars of the two similarity matrices. We use a subset of GLUE tasks (Wang et al., 2018) for our analysis. Given a task, we first perform adapter-based tuning and fine-tuning to adapt a BERT-base model (M org ) to the target task, which yields models M adapt and M f t respectively (See Appendix A.2 for training details). Then we pass sentences (or sentence-pairs depend on the task) from the development set to M org , M adapt , and M f t respectively. We extract representations at each layer from the three models and select the corresponding representations of 5k randomly sampled tokens 2 (n = 5000) for evaluation. Note that the same set of tokens is used for all models. Finally, we compare the representations obtained from M adapt or M f t to those from M org using RSA. Figure 2 plots the results on STS-2, results of other tasks demonstrate a similar trend and can be found in Appendix A.3. For both fine-tuning and adapter-based tuning, we observe that the repre- sentation change generally arises in the top layers of the network, which is consistent with previous findings that higher layers are more task relevant (Howard and Ruder, 2018). It can be clearly observed that compared to fine-tuning, adapterbased tuning yields representations with less deviation from those of BERT-base at each layer, which verifies our claim that adapter-based tuning can better regularize the tuning process by mitigating the forgetting problem. Apparently, this property of adapter tuning comes from that it freezes all the parameters of PrLMs. And because of the skipconnection in the adapter, the hidden representation out of the adapter can mimic the input representation, in this way, some of the original knowledge of PrLMs (before injecting adapters) can be preserved.
Since we find that adapter-based tuning better regularizes the learning process, the next question is how this property will help to improve the performance when adapting PrLMs to downstream tasks. We conduct extensive experiments to investigate this. The remainder of this paper is organized as follows. We compare fine-tuning and adapterbased tuning on monolingual text-level adaptation tasks in §3, followed by cross-lingual adaptation in §4. Further analysis about the training stability and generalization capabilities is shown in §5.

Monolingual Adaptation
In this section, we first experiment with eight datasets as used in Gururangan et al. (2020) including both high-and low-resource tasks ( §3.1). We refer this set of tasks as Task Adaptation Evaluation (TAE). We observe that adapter-based tuning consistently outperforms fine-tuning on lowresource tasks, while they perform similarly on high-resource tasks. We further confirm the effectiveness of adapters in low-resource settings on the GLUE benchmark (Wang et al., 2018) ( §3.2).

TAE
TAE consists of four domains (biomedical, computer science, news text, and AMAZON reviews) and eight classification tasks (two in each domain), whose domain diversity makes it suitable to assess the adaptation effectiveness of different approaches. Detailed data statistics are displayed in Appendix A.1. We consider tasks with fewer than 5k training examples as low-resource tasks and the others as high-resource tasks. Experimental Setup We perform supervised fine-tuning on RoBERTa-base as our baseline (RoBa.-ft). For adapter-based tuning, we set the hidden size m of adapters to 256 (RoBa.adapter 256 ). We also present the results of adding task-adaptive pretraining (+TAPT) (Gururangan et al., 2020). In this setting, before fine-tuning or adapter-based tuning, the model was trained with a masked language modeling (MLM) objective on the training texts (without labels) of the task. Note that in RoBa.-adapter 256 +TAPT, we also use adapter-based tuning for TAPT where only the weights of adapters are updated at the TAPT stage. This is to evaluate whether adapter-based tuning can work with unsupervised learning objectives. We follow the experimental settings in Gururangan et al. (2020) for TAPT. For fine-tuning and adapter-based tuning, we train models for 20 epochs to make sure they are sufficiently trained and save the checkpoint after each training epoch. We select the checkpoint that achieves the best score on the validation set for evaluation on the test set. The batch size is set to 16 for both methods. The learning rate is set to 2e-5 for fine-tuning, and 1e-4 for adapter-based tuning. See Appendix A.2 for the hyperparameter selection process and more training details.
Results Table 1 presents the comparison results.
We report the average result over 5 runs with different random seeds. On four low-resource tasks, adapter-based tuning consistently outperforms finetuning and improves the average result by 1.9%. Adapter-based tuning alone without TAPT even outperforms fine-tuning with TAPT. Besides, adding TAPT before adapter-based tuning further improves the performance on 3 out of 4 low-resource tasks, which suggests that adapter-based tuning works with both supervised and unsupervised objectives. Another finding is that when trained on highresource tasks, both methods achieve similar results. To verify the effects of training size, on high-resource tasks, we plot the performances with varying numbers of training examples in Figure 3. The trend is consistent with our existing observations -adapter-based tuning achieves better results when the training set is small while fine-tuning will gradually catch up with an increasing number of training examples.

GLUE Low-resource Adaptation
To further validate that adapters tend to generalize better than fine-tuning under low-resource settings, we follow Zhang et al. (2021) to study lowresource adaptation using eight datasets from the GLUE benchmark (Wang et al., 2018) which covers four types of tasks: natural language inference (MNLI, QNLI, RTE), paraphrase detection (MRPC, QQP), sentiment classification (SST-2) and linguistic acceptability (CoLA). Appendix A.1 provides detailed data statistics and descriptions.
Experimental Setup For each dataset, we simulate two low-resource settings by randomly sampling 1k and 5k instances from the original training  We perform fine-tuning on BERT-base (BERTft) and RoBERTa-base (RoBa.-ft) respectively as our baselines. We set the learning rate to 2e-5 and the batch size to 16 for BERT and RoBERTa finetuning experiments (See Appendix A.2 for details). For adapters, we only tune its hidden sizes in {64, 128, 256}, setting the learning rate to 1e-4 and batch size to 16 as the same used in §3.1.
Results Table 2 presents the comparison results. For adapter-based tuning, we report two results on each task. One is obtained with the optimal hidden size which varies per dataset, and the other is obtained with the size of 64. We observe that adapter-based tuning outperforms fine-tuning most of the time under both 1k and 5k settings. In particular, the performance gain is more significant in 1k setting, where on average across all tasks, adapterbased tuning outperforms fine-tuning by 2.5% and 0.7% on BERT and RoBERTa respectively.

Discussions
One consistent observation from § 3.1 and § 3.2 is that adapters tend to outperform fine-tuning on 3 Users are limited to a maximum of two submissions per day to obtain test results, which is inconvenient for a large number of runs text-level classification tasks when the training set is small, but with more training samples, the benefit of adapters is less significant. In low-resource setting, fine-tuning has more severe overfitting problem, since it has much more tunable parameters compared to adapter-tuning, so adapter-tuning works better than fine-tuning. However, in highresource setting, overfitting is not a big issue and model capacity counts more. Obviously, the model capacity under fine-tuning is larger than that under adapter-tuning since fine-tuning can update much more model parameters.
When comparing the improvements of adapter tuning over fine-tuning on tasks from TAE ( § 3.1) and GLUE ( § 3.2), we find that the improvement is more significant on low-resource tasks from TAE -on RoBERTa-base, the average improvement brought by adapters is 1.9% across four lowresource tasks from TAE, while the average improvement on GLUE is 0.7% and 0.4% in 1k and 5k settings respectively. As indicated in Gururangan et al. (2020), the TAE dataset is more domainspecific and has less overlap with the corpus used for RoBERTa-base pretraining, one intuitive explanation for this observation is that fine-tuning has more severe forgetting and overfitting issues in domain adaptation where the target domain is dissimilar to the source domain in pretraining, thus adapter-based tuning is more preferable in this scenario.    Note that UD-POS, Wikiann NER, and XNLI are all high-resource tasks, with 20k, 20k, and 400k training samples respectively. Unlike monolingual tasks, adapters achieve consistent performance gains even under high-resource settings on cross-lingual tasks. We suspect that the ability to mitigate forgetting is more useful in cross-lingual scenarios since the model knowledge of the target languages only comes from pretraining. Adapterbased tuning can better maintain the knowledge. We further investigate the effectiveness of adapterbased tuning on XNLI with smaller training sets. Table 4 summarizes the results when trained on 5%, 10%, and 20% of the original training sets. In all settings, adapters still demonstrate consistent improvements over fine-tuning.

Analysis
Adapter Hidden Size The hidden size m 4 is the only adapter-specific hyperparameter. As indicated in Houlsby et al. (2019), the hidden size provides a simple means to trade off performance with parameter efficiency. Table 5 shows the performance with different hidden sizes, from which we find that increasing the hidden size may not always lead to performance gains. For monolingual low-resource adaptation, TAE tasks prefer a larger hidden size, while the results on GLUE are similar across different hidden sizes. We suspect that this is due to that TAE datasets are more dissimilar to the pretraining corpus, which requires relatively more trainable parameters to learn the domain-specific knowledge. On XNLI, a larger hidden size helps improve the performance when the full data is used. However, when only 5% training data is used, increasing the hidden size does not yield consistent improvements. The results indicate that the optimal hidden size depends on both the domain and the training size of the task.
Learning Rate Robustness We compare the two tuning methods in terms of their stability w.r.t the learning rate. Figure 4 shows the performance distributions on CoLA and MNLI under 1k and 5k settings. The learning rates are varied in {2e-5, 4e-5, 6e-5, 8e-5, 1e-4}. Each box in the plot is drawn from the results of 20 runs with different random seeds. We observe that fine-tuning yields larger variances when increasing the learning rates. It often collapses with learning rates larger than 4e-5 4 The fraction of adapter parameters w.r.t. BERT-base (110M parameters) is 2%, 4%, and 6% when m is set to 64, 128, and 256. The fraction w.r.t. XLMR-large (550M parameters) is 1%, 2%, and 3%, respectively.    when RoBERTa-base is used. Adapter-based tuning is more stable across a wider range of learning rates.
Overfitting and Generalization Here, we first study the robustness of adapter-based tuning to overfitting. We use CoLA, MRPC, QNLI, and SST-2 with their original training and development sets for our analysis. The CoLA and MRPC contain 8.5k and 3.7k training samples and are regarded as low-resource tasks. The QNLI and SST-2 con- tain 104k and 67k training samples and are used as high-resource tasks. We train the two low-resource tasks for 10k steps, and the high resource tasks for 60k steps with a batch size of 16. We use BERTbase for all experiments. Figure 5 plots the loss curves on dev sets w.r.t training steps. We observe that models with fine-tuning can easily overfit on both low-and high-resource tasks. Adapter-based tuning is more robust to overfitting. Additional results on accuracy w.r.t. training steps and a similar analysis on XNLI are in Appendix A.3.
We also present the mean and best dev results across all evaluation steps in Table 6, where we perform an evaluation step every 20 training steps. The mean results of adapter-based tuning consistently outperform those of fine-tuning. The differences between the mean and the best values are also smaller with adapter-based tuning. The results suggest that the performance of adapters is more stable over fine-tuning along the training process.
Training neural networks can be viewed as searching for a good minima in the non-convex landscape defined by the loss function. Prior work (Hochreiter and Schmidhuber, 1997;Li et al., 2018) shows that the flatness of a local minima correlates with the generalization capability. Thus, we further show the loss landscapes of the two tuning methods. Following Hao et al. (2019), we plot the loss curve by linear interpolation between θ 0 and θ 1 with function f (α) = L(θ 0 + α · (θ 1 − θ 0 )), where θ 0 and θ 1 denote the model weights before and after tuning. L(θ) is the loss function and α is a scalar parameter. In our experiments, we set the range of α to [−2, 2] and uniformly sample 20 points. Figure 6 shows the loss landscape curves on CoLA and SST based on BERT-base. It shows that the minimas of adapter-based tuning are more wide and flat, which indicates that adapter-based tuning tends to generalize better.
Compare to Mixout The focus of this paper is to answer the question -besides being parameter-  efficient, when would adapter-based tuning be more effective than fine-tuning for PrLM adaptation? Thus, we only use fine-tuning as our primary baseline in previous sections. Here, for the sake of curiosity, we further compare adapter-based tuning to fine-tuning regularized by mixout (Lee et al., 2020) on a subset of GLUE tasks, since mixout similarly regularizes the learning process by mitigating the forgetting issue. Specifically, it replaces all outgoing parameters from a randomly selected neuron to the corresponding parameters of the initial model without tuning, such that it reduces divergence from the initial model. Following the suggestions in the paper, we conduct experiments by replacing all dropout modules in the network with mixout and set the mixout probability to 0.9. From the results in Table 7, we find that using adapterbased tuning alone yields the best results in most cases. Applying mixout to fine-tuning improves the performance on CoLA and MRPC only. However, applying it to adapters instead tends to degrade the performance. We suspect that this is because the number of trainable parameters of adapters is very few to begin with. Hence, further replacing a large percentage of them with their initial weights may weaken the learning ability.

Related Work
Fine-tuning pretrained large scale language models has proven its effectiveness on a wide range of NLP tasks (Devlin et al., 2019;Liu et al., 2019;Conneau et al., 2020a;Brown et al., 2020). However, fine-tuning requires a new set of weights for each task, which is parameter inefficient. Adapterbased tuning is proposed to deal with this problem (Houlsby et al., 2019). Most previous work has demonstrated that it achieves comparable performance to fine-tuning (Bapna and Firat, 2019; Pfeiffer et al., 2020b,a,c;Rücklé et al., 2020;Wang et al., 2020;Guo et al., 2020). However, existing work mostly focuses on the parameter-efficient aspect while overlooks the effectiveness.
Fine-tuning PrLMs in a low-resource setting has been studied for a while (Dodge et al., 2020;Lee et al., 2020;Phang et al., 2018;Jiang et al., 2020;Zhang et al., 2021). Previous work points out that with large-scale parameters, fine-tuning on a few samples can lead to overfitting and bad generalization, which causes the results unstable. Phang et al. (2018) find that pretraining on an intermediate task can improve fine-tuning outcomes. Jiang et al. (2020) improve the robustness of fine-tuning by controlling the model complexity and preventing aggressive updating. On the other hand, catastrophic forgetting can appear when transferring a pretrained neural networks (French, 1999;Mc-Closkey and Cohen, 1989;Goodfellow et al., 2013), where the learned knowledge from pretraining is lost when adapting to downstream tasks. This phenomenon often appears in NLP tasks (Mou et al., 2016;Arora et al., 2019). To relieve this problem of adapting pretrained language models, Howard and Ruder (2018) gradually unfreeze the layers starting from the last layer and Sun et al. (2019) find assigning lower learning rate to the bottom layers can improve the performance. Lee et al. (2020) regularize learning by encouraging the weights of the updated model to stay close to the initial weights. Aghajanyan et al. (2021) regularize fine-tuning by introducing noise to the input which is similar to adversarial training for fine-tuning studied in Zhu et al. (2020). Mosbach et al. (2021) point out that the instability of fine-tuning lies in the optimizer and propose to revise the Adam optimizer by replacing it with a de-bias version. Chen et al. (2020) propose a mechanism to recall the knowledge from pretraining tasks.

Conclusion
Prior work often focuses on the parameter-efficient aspect while overlooks the effectiveness of adapterbased tuning. We empirically demonstrate that adapter-based tuning can better regularize the learning process. We conduct extensive experiments to verify its effectiveness and conclude that 1) it tends to outperform fine-tuning on both low-resource and cross-lingual tasks; 2) it demonstrates higher stability under different learning rates compared to fine-tuning. We hope our study will inspire more future work on PrLM adaptation based on adapters and other methods that only tune part of the PrLM parameters.

A Appendix
A.1 Datasets TAE Table 8 presents the data statistics of the TAE datasets we used in § 3.1.
GLUE Table 9 presents the statistics and descriptions of GLUE tasks. In § 3.2, to investigate the effectiveness in low-resource scenarios, we simulate two low-resource settings by randomly sampling 1k and 5k examples respectively from each of the original training set as the new training sets. In each setting, we draw 1k samples from the remaining training set as our validation set and use the original validation set as held-out test set since the original GLUE test sets are not publicly available.
For the RSA analysis in § 2 and the analysis of overfitting and generalization in § 5, we use the original training and development sets for analysis purpose, as this better reveals the behaviors under both high-and low-resource settings.

A.2 Experimental Details
Implementation We use language model implementations from HuggingFace Transfromers library (Wolf et al., 2019). Our adapter implementation is also based on that. Following standard practice (Devlin et al., 2019), we pass the final layer [CLS] token representation to a task-specific feedforward layer for prediction on downstream tasks. Each experiment was performed on a single v100 GPU. We use the Adam optimizer (Kingma and Ba, 2015) with a linear learning rate scheduler.
Training Details on TAE and GLUE For both fine-tuning and adapter-based tuning, we train models for a fixed number of epochs, and select models with the best validation performances on epoch end for evaluation.
For adapters, on TAE, we set the batch size the same as used in fine-tuning, and tune learning rates in {2e-5, 5e-5,1e-4, 2e-4} and adapter's hidden size in {64, 128, 256} to select the best configuration across all tasks. On GLUE, we keep the learning rate and batch size the same as used in TAE, and tune the adapter's hidden sizes in {64, 128, 256} for each task. We use the same hyperparameter setting for all our analysis experiments with GLUE tasks as well. Table 10 presents the detailed hyperparameter settings for TAE and GLUE.
Training Details on Xtreme Tasks For UD-POS, Wikiann NER, and XNLI, we use batch size 32, and tune learning rates in {1e-5, 2e-5, 3e-5, 4e-5, 5e-5} on each task. We tune the adapter's hidden sizes in {64, 128, 256} to select the best value across all tasks. We use the English training and development sets of each task for hyperparameter tuning. Table 11 presents the detailed settings. Figure 7 presents additional Representational Similarity Analysis (RSA) plots on three GLUE tasks as mentioned in § 2. We further conduct RSA to show the deviation of representation space before and after tuning (with English training set) on three distant languages (zh, ja, th) from the cross-lingual NER task. Figure 8 presents the results.

RSA
Accuracy w.r.t Training Steps Figure 9 shows the change of accuracy with increasing training steps on four GLUE tasks. The results again indicate that adapter-based tuning is more robust to overfitting.
Overfitting Analysis on XNLI We train XLMRlarge with 10% of the original English training data of XNLI, and plot the average loss and accuracy curves on development sets across all target languages except English in Figure 10. The plots demonstrate similar trends as shown in the plots of GLUE tasks ( Figure 5 and Figure 9), where models with fine-tuning are easily overfitted and adapterbased tuning is more robust to overfitting. Table 12 and Table 13 presents the cross-lingual POS tagging results and the cross-lingual NER results on each language respectively.  Figure 7: Comparison of the representations obtained at each layer before (Base) and after adapter-based tuning or fine-tuning on BERT-base using Representational Similarity Analysis (RSA). The original training and dev sets of CoLA, MRPC, and MNLI are used for this analysis. 5000 tokens are randomly sampled from the dev set of each task for computing RSA. A higher score indicates that the representation spaces before and after tuning are more similar.  Figure 8: Comparison of the representations obtained at each of the top 12 layers (layer 13-24) before (Base) and after adapter-based tuning or fine-tuning on XLMR-large using Representational Similarity Analysis (RSA). We show results on 3 distant languages from the Wikiann NER task. 5000 tokens are randomly sampled from the dev set of each language for computing RSA. A higher score indicates that the representation spaces before and after tuning are more similar.      Table 12: Zero-shot cross-lingual POS tagging accuracy on the test set of each target language. Results with " †" are taken from (Hu et al., 2020). Results with " * " are reproduced by us.  Table 13: Zero-shot cross-lingual NER F1 on the test set of each language. Results with " †" are taken from (Hu et al., 2020). Results with " * " are reproduced by us.  Table 14: Zero-shot XNLI accuracy on the test set of each language when trained with full data. Results with " †" are taken from (Hu et al., 2020). Results with " * " are reproduced by us.  Table 15: Zero-shot XNLI accuracy on the test set of each language when trained on 5%, 10%, 20% of training data respectively.