AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

Transformer-based pre-trained models with millions of parameters require large storage. Recent approaches tackle this shortcoming by training adapters, but these approaches still require a relatively large number of parameters. In this study, AdapterBias, a surprisingly simple yet effective adapter architecture, is proposed. AdapterBias adds a token-dependent shift to the hidden output of transformer layers to adapt to downstream tasks with only a vector and a linear layer. Extensive experiments are conducted to demonstrate the effectiveness of AdapterBias. The experiments show that our proposed method can dramatically reduce the trainable parameters compared to the previous works with a minimal decrease in task performances compared with fine-tuned pre-trained models. We further find that AdapterBias automatically learns to assign more significant representation shifts to the tokens related to the task in consideration.


Introduction
While large pre-trained language models (PLMs) reached state-of-the-art results on natural language processing (NLP) tasks, PLMs require updating all parameters and storing the fully fine-tuned model for each downstream task. These requirements have led to difficulties in real-world applications. Moreover, fine-tuning PLMs on lowresource datasets is subject to instabilities.
To tackle these shortcomings, Adapters (Houlsby et al., 2019), a more parameter-efficient alternative training strategy for the transformer architecture (Vaswani et al., 2017) have been proposed. Instead of full fine-tuning the whole model, Adapters introduce extra tunable weights and freeze the original parameters of PLM. Adapters demonstrated comparable performance with fully fine-tuning the Figure 1: Overview of the main concept of our work compared to BitFit (Ben Zaken et al., 2021). Left: Bit-Fit tends to add the same representation shift to different tokens. Right: Our work applies different representation shifts to tokens considering their importance to the downstream task and their characteristics. The shifts of the input words that are more task-related is more significant than that of other tokens. For example, in SST-2 (Socher et al., 2013), which is a semantic task, the representation shifts of the semantic words, such as "kind" and "worse", are larger than that of other words. entire model. Although Adapters solve the problem of the PLM's massive parameters, researchers are curious about how many more parameters are required to reach state-of-the-art performance on standard NLP tasks. The results in Houlsby et al. (2019) have shown that the performance on GLUE benchmark (Wang et al., 2018) is almost the same when removing the Adapters in the lower layers, which indicates that not every adapter is useful. It raises the question of whether adapters can be even more parameter-efficient.
To develop practical and memory-efficient methods of utilizing PLMs, Diff pruning (Guo et al., 2020) enables parameter-efficient transfer learning that scales well with new tasks. The approach learns a task-specific "diff" vector that extends the original pre-trained parameters and encourages the sparsity of the vector through L 0 -norm regularization. Another approach is BitFit (Ben Zaken et al., 2021), which shows that with small-to-medium training data, fine-tuning only a subset of the bias terms of pre-trained BERT models (Devlin et al., 2018) is competitive with fine-tuning the entire model. The central concept of these approaches is to add task-specific shifts to each output representation of the PLM layers so as to adapt to different tasks. In the previous works, Ben Zaken et al. (2021); Guo et al. (2020) both add the same shifts to the output representation regardless of which token is more relevant to the task. However, considering some specific tokens might be more critical to a particular task, the representation can better adapt to the downstream task under a limited amount of parameters if these shifts are based on the input tokens.
Based on this concept, in this study, we add token-dependent biases to the shifts by proposing AdapterBias, which consists of a vector and a linear layer (L α ). The vector represents the task-specific shift, and L α produces the weights for input tokens. Thus, with the vector and the weights, AdapterBias can add a token-dependent shift to the transformer layer. Since the concept of BitFit (Ben Zaken et al., 2021) is similar to AdapterBias by adding a shift to the representation, we demonstrate the difference between BitFit and AdapterBias in Figure 1. Bit-Fit assigns identical shifts to all the tokens, while AdapterBias adds more significant shifts to the representations that are related to the task.
With fewer trainable parameters required, AdapterBias achieves comparable performance on the GLUE benchmark with Houlsby et al. (2019);Pfeiffer et al. (2020a); Guo et al. (2020);Ben Zaken et al. (2021);Hu et al. (2021). We further decrease the parameters of AdapterBias in different ways, including partial weight-sharing in AdapterBias and adding L 0 -norm regularization. Finally, Adapter-Bias has better interpretability due to its simplicity. We use different tools, including word cloud and PCA (Jolliffe, 2002), to visualize what Adapter-Bias has learned, and we found that the proposed approach automatically learns to assign larger representation shifts to the task-related tokens.

Related Work
For NLP tasks, adapters are introduced for the transformer architecture. A set of adapter parameters was added at each transformer layer, which is mostly bottleneck architectures Houlsby et al. (2019). By keeping the output dimension identical, they cause no change to the structure or parameters of the original model.
Adapters quickly gained popularity in NLP with various applications. For multi-task learning (Caruana, 1997;Zhang and Yang, 2017;Liu et al., 2019b), a projected self-attention layer is proposed by Stickland and Murray (2019), while Bapna et al. (2019) proposed an additional layer norm suitable for machine translation.
Besides the applications of adapters, researchers are also dedicated to improving their performance. Based on the architecture introduced by Houlsby et al. (2019), AdapterFusion (Pfeiffer et al., 2020a) leveraged knowledge from multiple tasks with a new two-stage learning algorithm. Despite the recent popularity of these methods, they still train a relatively large number of training parameters.
Recently, studies start to focus on improving the parameter-efficiency of adaptation to a new task (Yang et al., 2021). Diff-pruning (Guo et al., 2020) achieves parameter efficiency by adding a sparse, task-specific difference-vector to the fixed original parameters. The vector is adaptively pruned during training with a differentiable approximation to the L 0 -norm penalty to encourage sparsity. Rücklé et al. (2020) introduced AdapterDrop, which has been recently integrated into Adapter-Hub (Pfeiffer et al., 2020b). It removes adapters from lower transformer layers during training and inference, which can dynamically reduce the computational cost. Mahabadi et al. (2021) proposed Compacter, which improved the trade-off between performance and trainable parameters per task with low-rank optimization.
On the other hand, without modifying the architecture of the PLM, BitFit (Ben Zaken et al., 2021) shows that fine-tuning only the bias terms of a large PLM is also competitive with fine-tuning the entire model. Fine-tuning only the bias terms can be considered as adding a task-specific shift to the token representation. BitFit is most similar to our work. While in BitFit, the shifts added to all the representations are exactly the same for all input tokens, in our work, the shifts are token-dependent. duces a suitable weight for the bias based on the input token.
Problem Formulation We consider the general problem of fine-tuning PLMs, where the training data D = (x i , y i ) N n=1 is given. Assume that given a PLM with parameters θ and AdapterBias with parameters θ . During the training stage, we freeze θ and tune θ only.

AdapterBias
The architecture of AdapterBias is shown in the right part of Figure 2. AdapterBias consists of two modules: a vector (v) and a linear layer (L α ). v is a task-specific shift added to the output of each transformer layer. The tokens which are more related to the task should be assigned larger representation shifts than other tokens. The linear layer (L α ) produces a token-dependent weight vector α = [α 1 , α 2 . . . α m ] T , where α i is the weight of the i th token's representation shift. By applying the token-specific weight to the task-specific representation shift (v), AdapterBias can focus on the tokens that are more important to the task and is able to adapt to different downstream tasks efficiently.
We define the output of AdapterBias as the bias (B), which is the outer product of v and the learned weights vector α. When the dimension of the token's representation is r with m input tokens, the function can be defined as follows: where v ∈ R r , α ∈ R m , and B ∈ R r×m . To further elaborate on the details of Adapter-Bias, we give an example of how AdapterBias produces B and how B adds to the transformer layer. In Figure 3, we assume that there are three representation outputs (r 1 , r 2 , r 3 ) after the first layer normalization. The dimension of r 1 , r 2 and r 3 is the dimension of the 2 nd feedforward layer, while the input dimension of the linear layer (L α ) is the output dimension of the first feed-forward layer with the token representation (r 1 , r 2 , r 3 ) as its inputs. The linear layer (L α ) produces α, where α ∈ R 3 . The blocks in different colors represent the difference of the weights (α 1 , α 2 , α 3 ). Take BERT-base for example, after performing outer product with the weights vector α and the vector (v), the dimension of B becomes 768 × 3. For example, b 1 , the first column of B, is the shift for the first token representation.

Further improvement on parameter-efficiency of AdapterBias
In this section, we experiment on two different methods to make AdapterBias more parameter efficient. One is partial weight-sharing of AdapterBias among transformer layers, another is enforcing the weights of the linear layer (L α ) to be sparse by utilizing L 0 -norm penalty.

Cross-layer parameters sharing in AdapterBias
Redundancies have been observed in the information captured by adapters, with adapters in lower layers being less important (Houlsby et al., 2019). In addition, sharing parameters of the Adapter across layers leads to a comparatively small drop in performance in some tasks. In light of the above information, we further reduce the number of parameters required for each task by partially sharing the weights of the adapters across all transformer layers. The experimental results are discussed at Section 4.6.1.

L 0 regularization in AdapterBias
Sparsity has been utilized in various parameterefficient methods. For applications in NLP tasks, Diff-pruning (Guo et al., 2020) learns a sparse vector added to the whole PLM with L 0 -norm penalty. Inspired by their work, we further apply L 0 -norm regularization to L α in the AdapterBias module, aiming to encourage the sparsity of L α . We choose to drop L α because it contributes most of the parameters in AdapterBias. Encouraging its sparsity can further increase the parameter efficiency. Note that we specifically apply L 0 regularization in Section 4.6.2. In AdapterBias, we add L 0 -norm penalty to the linear layer (L α ). The optimization problem can be expressed as, where L(D; ·) represents the original loss with training data D. λ is the hyperparameter for L 0norm penalty. Note that θ represents trainable parameters and θ Lα represents the parameters of L α in AdapterBias. Following the work of Diffpruning, we utilize a relaxed mask vector (Louizos et al., 2017) with a stretched Hard-Concrete distribution (Jang et al., 2016;Maddison et al., 2016) to encourage L 0 sparsity.

Experiments
In this section, we evaluate the effectiveness of our proposed adapter module in NLP training tasks, and provide the analysis of what AdapterBias has learned in different tasks.

Experimental settings
We base our experiments on HuggingFace PyTorch implementation (Wolf et al., 2019) of BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019c) models. The learning rate is set in the range [10 −4 , 10 −3 ], with AdamW (Loshchilov and Hutter, 2017) as the optimizer. GLUE benchmark (Wang et al., 2018) and SQuAD v1.0 (Rajpurkar et al., 2016) are the training data in our settings. The training details are shown in Appendix A.3. Note that the second layer normalization in each transformer layer is also tuned during the training stage, corresponding to the orange component in the right part of Figure 2. We experiment with 3 random seeds and choose the seed with the best performance on the validation set to evaluate on the GLUE server. We report the test metrics provided on the submission website 2 .

Results on GLUE
In this section, we compare AdapterBias to other parameter-efficient methods, including Adapters (Houlsby et al., 2019), Diff-pruning (Guo et al., 2020) (Guo et al., 2020) achieves the best average score among all parameter-efficient methods, their work trains an additional vector whose parameter count is equivalent to the parameters of the whole PLM. Thus, Diff-pruning requires 340M trainable parameters of BERT-large during the training stage, while AdapterBias only trains 0.17M parameters. Furthermore, AdapterBias achieves comparable performance with BitFit and LoRA with fewer parameters needed per task. This shows that AdapterBias is a worthwhile targeted fine-tuning method.

Different base models
To analyze the generalization ability of this approach to different PLMs on different models of AdapterBias, as shown in Table 2, we apply AdapterBias in different transformer-based PLMs, including BERT-base (BB), BERT-large (BL), RoBERTa-base (RoB), and RoBERTa-large (RoL), on the GLUE benchmark. All results are scored by the GLUE evaluation server. Compared with BitFit, In Table 2, not only can AdapterBias perform well on BERT but also achieve competitive performance on larger PLMs such as RoBERTa.

Size of training data
In the previous experimental results, we observe that AdapterBias tends to have higher performance on tasks with a smaller amount of data (i.e. CoLA, SST-2, and RTE).    Figure 4, which shows the tendency of AdapterBias outperforming full fine-tuning when the size of the training dataset is smaller. However, with more training data available, the trend is reversed. The results show that AdapterBias has the ability to outperform fine-tuning the whole PLM with small-to-medium data size, similarly to BitFit.

Investigation on the effectiveness of token dependent representation shift
Different from BitFit (Ben Zaken et al., 2021), where the bias terms in all transformer layers are tuned, we claim that the bias added to the representation should be token-dependent, and proposed AdapterBias based on this concept. We conduct ablation studies to verify this claim. In this experiment, the linear layer (L α ) in AdapterBias that produces the token-dependent weights vector (α) is removed; that is, only the v is trained. All shifts added to the representation outputs are identical within the same transformer layer. The experiments are conducted with BERT-base model. We report the test scores on the GLUE benchmark in Table 3. The performance of AdapterBias without the linear layer (L α ) dramatically decreases. Without L α , it is hard for the vector (v) to adapt to different downstream tasks. This result demonstrates the importance of L α . In other words, assigning different shifts to different token representations improves the performance of the method.

Improving the parameter efficiency of AdapterBias
We further apply two additional methods to AdapterBias to enhance its parameter efficiency. Experiments are conducted to exami whether AdapterBias can be more parameter-efficient by sharing its components across all layers. Moreover, we experiment on adding L 0 -norm regularization during the training stage to encourage the sparsity of AdapterBias.

Sharing components in AdapterBias
In this experiment, we conduct an ablation study of partial weight-sharing in the AdapterBias module. In Table 4, we share components of Adapter-Bias among different transformer layers. Share v represents sharing v across all transformer layers, while Share L α means sharing the linear layer (L α ). Share v+L α denotes sharing one Adapter-Bias across all transformer layers. As can be seen in Table 4, the performance of Share L α stands out among other partial weight-sharing methods, while Share v leads to a poor performance. From the experiments above, we conclude that the linear layer (L α ) captures general task information by learning the weights of the bias for different tokens. Thus, sharing L α across all layers results in better performance compared to other components. The vector module (v) in AdapterBias aims to learn local information in each transformer layer. If v among different transformer layers are shared, the performance drops dramatically. This might be due to a failure of v to learn general information which can be adapted to each individual transformer layer.   Table 5: Performance of our AdapterBias with L 0 -norm regularization. Here we experiment with two models: BERT-base (BB), and BERT-large (BL). The settings are the same as in Table 1. The Full-FT represents finetuning the whole PLM without adding adapters.

L 0 -norm regularization in AdapterBias
We observed that many of the trained parameters in L α have values that are extremely close to zero after tuning on downstream tasks, which might cause redundancy of the parameters. To further encourage the sparsity of AdapterBias, we add L 0norm regularization to L α during the training stage.
In Table 5, we use BERT-base (BB) and BERTlarge (BL) as the PLMs. We compare the performance of fine-tuning, the original AdapterBias, and the one trained with L 0 -norm regularization. The experiment shows that adding L 0 -norm regularization during the training step improves the performance on 7 out of 9 tasks in BERT-base models. However, the performance did not improve when applied to BERT-large models. As for the parameter efficiency of applying L 0 -norm penalty, the linear layer (L α ) with L 0 -norm penalty saves about 17% parameter on average compared to the original AdapterBias. The details of the reduced parameters of each task are shown in Appendix A.3.

What AdapterBias learns
AdapterBias has good interpretability due to its simplicity. Compared to the similar work Bit-Fit (Ben Zaken et al., 2021), where the shifts are identical for all tokens, AdapterBias adds tokendependent shifts to the output representation. By observing these token-dependent shifts, we analyze what AdapterBias learns when adapting to downstream tasks.

Average representation shifting in transformer layers
In light of the works of Liu et al. (2019a); Tenney et al. (2019); Kovaleva et al. (2019), which show that different information is being encoded by different transformer layers of PLMs. We assume that AdapterBias provides different representation shifts to the transformer layers through task-specific fine-tuning. In AdapterBias, the linear layer (L α ) produces a weights vector α for representation shifts, therefore, the average absolute value of vector α can give us a look at the shifting amount in the transformer layers when adapting to downstream tasks. In Figure 5, the layers are ordered from lower to upper. From the experimental result, we find that the weight in each layer is considerably different in different tasks in general.
CoLA (Warstadt et al., 2019) is a syntactic task that consists of English acceptability judgments in the GLUE benchmark. As shown in Figure 5, its average shift at the ninth layer is the highest among all layers, which is quite different from the others. We speculate that the ninth layer has the ability to extract the syntactic information, leading AdapterBias to add the largest shift in this layer. Our experiment has a similar observation with the work of Jawahar et al. (2019). They observe on a syntactic task with BShift (Conneau et al., 2018) that the ninth layer of BERT embeds a rich hierarchy of syntactic information. (Jawahar et al., 2019) Moreover, we observe similar distributions between specific tasks. For instance, RTE (Giampiccolo et al., 2007;Bentivogli et al., 2009) and MNLI (Williams et al., 2017), where both recognize textual entailment, have higher values in the upper layers than the lower ones.
Based on these findings, we find that Adapter-Bias assigns suitable representation shifts in different tasks. For tasks with similar objectives, AdapterBias tends to add similar representation shifts.

Which kind of word does L α focus on
Since α i represents the weight of the representation shift for i th token in a transformer layer, we can observe the significance of i th token from the summation of α i in all the transformer layers. Special tokens, including [CLS], [SEP], and [PAD], are not included for analysis. We use the validation sets of CoLA and SST-2, and word cloud is used for visualizations. Figure 7: Word cloud of SST-2, a corpus of movie reviews categorized in two sentimental classes (i.e. positive, negative). The visualization approach is the same as in Figure 6.
In Figure 6, we visualize all words in the validation data of CoLA. The result shows that Adapter-Bias focuses more on reflexive pronouns, such as yourself, himself, and myself. This is because there are many incorrect sentences with misused reflexive pronouns, such as "He washed yourself." In Figure 7, we visualize all words in the validation data of SST-2. The result shows that Adapter-Bias focuses more on adjectives, such as "bad", "awful", and "worst". SST-2 is a binary sentiment analysis dataset, which classifies movie reviews into positive and negative classes. AdapterBias learns that adjectives often constitute a crucial factor in sentiment analysis during tuning, and adds larger shifts to these adjective tokens.

Conclusion
In this study, we present AdapterBias. By adding token-dependent representation shifts to the PLM, AdapterBias shows competitive results even though it uses far fewer parameters than the existing methods. Through extensive experiments, not only does AdapterBias reach competitive results on the GLUE benchmark, but also obtain good performance on small-to-medium datasets. In addition, we demonstrate the robustness of AdapterBias to different PLMs. Finally, we provide analysis on what AdapterBias learns by comparing α, the weights of representation shift for different tokens, finding AdapterBias has the ability to identify taskspecific information. Our study is different from the previous architectures of adapters by proposing a simple adapter that can produce suitable representation shifts for different tokens.

A.1 Training Details
We train our model on Pytorch. The training details are shown in Table A. In addition, the bottleneck of Adapters (Houlsby et al., 2019) and is 32.
A.2 L 0 -norm regularization in AdapterBias In Table B, we report the remaining parameters of utilizing L 0 -norm regularization compared with the original AdapterBias. BERT-base (BB) and BERT-large (BL) are used as PLMs.

A.3 The direction of representation shifts in different tasks
Different from BitFit (Ben Zaken et al., 2021), where all the representation shifts are identical within one task, AdapterBias produces different weights for the shift based on each token. In this section, we compare the transformed tokens in AdapterBias and BitFit. We utilize PCA (Jolliffe, 2002) to reduce the dimension of the vectors. In Figure A, we input five sentences from the evaluation set of SST-2. We experiment on the last transformer layer since it has the most obvious shifts compared to the previous layers. '0' with lighter color indicates the representation before shifting, which is the output of the first layer normalization. '1' with darker color is the shifted representation, which is the output of the second layer normalization. The color red represents positive sentences, and blue are the negative ones. The result shows that BitFit shifts all tokens towards the same direction regardless of the groundtruth label. On the other hand, AdapterBias discerns the label of the sentences and thus shifts the tokens of different sentences toward different directions.