Tiny-Attention Adapter: Contexts Are More Important Than the Number of Parameters

Adapter-tuning is a paradigm that transfers a pretrained language model to downstream tasks by adding and tuning a small number of new parameters. Previously proposed adapter architectures are all feed-forward neural networks. In this paper, we investigate the effectiveness of using tiny-attention—i.e., attention with extremely small per-head dimensionality—as adapters. Our tiny-attention adapter learns to modify the hidden states at each position directly conditioned on the hidden states at all the other positions, which is missed by the previously proposed adapters. Moreover, we view its multiple attention heads as a mixture of experts and propose to average their weights during deployment, which further reduces its inference computation cost. On the GLUE benchmark, our tiny-attention adapter outperforms the other parameter-efficient transfer learning methods as well as full fine-tuning while only updating 0.05% of the parameters. On the FewGLUE benchmark, its performance is comparable to that of GPT-3 and PET.


Introduction
Transferring a large pretrained language model (PLM) is a de facto paradigm of performing downstream tasks in natural language processing (NLP).A general approach is adapter-tuning, which means inserting adapters-i.e., neural networks with small numbers of parameters-into each pretrained layer and only updating the adapter parameters.Adapter-tuning is parameter-efficient and enjoys low computation cost since it keeps the PLM frozen.But it underperforms full fine-tuning which updates all the parameters of the PLM.
In this paper, we propose a new adapter architecture which outperforms full fine-tuning yet uses even fewer parameters than the previously proposed adapters as well as other parameter-efficient transfer learning methods; see Figure 1 AdaMix 89.9 Finetuning 88.9 Adapter H 87.8 Figure 1: Average performance of different parameterefficient transfer learning methods on the GLUE benchmark with roberta-large as the PLM.Our method is Tiny-Attn (with marker): "1H" and "4H" mean "one attention head" and "four attention heads (with the parameteraveraging trick in section 2.2)" respectively.
Our adapter is a multi-head attention module and its per-head dimensionality is extremely small; thus we call it tiny-attention.The architecture design is inspired by the following intuitions: 1.For each input sequence, each layer of the language model produces an embedding for each token in the sequence; see Figure 2a.2. All the parameter-efficient transfer learning methods learn to modify the embeddings towards the direction of performing the given task well; see He et al. (2021) for a thorough theoretical and empirical discussion.3. Almost all the previously proposed adapter architectures are feed-forward neural networks.Thus, we suspect that their embedding modifications are not as contextually rich as they should.
Therefore, we propose to use the attentative structure that allows the embedding modifications of each token to capture more contextual information The internal architecture of our tinyattention adapter.
Figure 2: The pipeline of applying our tiny-attention adapter to a pretrained language model for downstream tasks.
by directly looking at the embeddings of all the tokens.The dimensionality of each attention head need not to be large.In NLP tasks, contextual information is often demonstrated more important than model sizes: for example, in language modeling, a smaller model with a larger context window usually outperforms a larger model with a smaller context window (Dai et al., 2019;Wu et al., 2022;Yang et al., 2019).Additionally, we view the multiple attention heads of our tiny-attention adapter as a mixture of experts and then propose to average their weights during inference.This technique further reduces the inference cost.
We evaluated our tiny-attention adapter on the GLUE (Wang et al., 2018) and FewGLUE (Schick and Schütze, 2021;Wang et al., 2019) benchmarks.On GLUE, it updates only 0.05% of the parameters-an order of magnitude smaller than all the other methods-yet still outperforms full finetuning and nearly all the other parameter-efficient tuning methods.On FewGLUE, it is comparable to several strong competing methods including PET (Schick and Schütze, 2021) and GPT-3 (Brown et al., 2020).We also conducted ablation studies to investigate its effectiveness with varying placements and PLMs.

The Method
Figure 2a illustrates how a language model performs a downstream task.The language model has L layers.Given an input sequence x = x 0 x 1 . . .x T where x 0 is a special classification (CLS) token, each layer ℓ reads the embeddings given by the previous layer ℓ − 1 and produces the layer-ℓ embeddings h T .Then a task-specific decoderreads the top-layer embedding h (L) 0 of the CLS token and predicts a task-specific output ŷ for that sequence.
Transferring the language model involves updating its trainable parameters to minimize a taskspecific loss(ŷ, y) where y is the ground-truth label for the given x.For adapter-tuning, the trainable parameters only include those of the task-specific decoder and those of the adapters.Figure 2b shows a language model layer with our tiny-attention adapter placed between its attention module and feed-forward net: during training, only the decoder and adapter parameters (blue) are updated while the pretrained parameters (green) are all kept frozen.

Tiny-Attention Adapter
Our adapter has an attentative structure: as shown in Figure 2b, at each position t, it takes as input the intermediate embeddings z (ℓ) from not only the current position (information flow indicated by blue arrows ) but also all the other positions (information flow shown by red arrows ).For each token t, it produces a task-specific modification z(ℓ) t .Then the modified embeddings z (ℓ) t + z(ℓ) t are fed into the layer ℓ feed-forward net for producing h (ℓ) t .As shown in Figure 2c, the internal architecture of our tiny-attention adapter resembles an ordinary multi-head attention mechanism.Suppose it has M attention heads.Each attention head m produces a head-specific attention vector z(ℓ,m) t and the final attention vector z(ℓ) t is obtained by projecting the concatenation of all the z(ℓ,m) z(ℓ,m) T ) The math details of Attn (m) t is given in Appendix A. Why attention as adapter?Attention allows the task-specific modification z(ℓ) t for each token t to aggregate useful information from its full contexts at t = 0, 1, . . ., T .It is analogous to how the attention modules in the pretrained language model learned to construct contextual representations which helped optimize the language modeling objective during pretraining.Therefore, when the pretrained attention modules are frozen, it seems natural to adopt new trainable attention modules for their desired behavior.
The key difference between our tiny-attention and an ordinary attention is that our per-head dimension is tiny: i.e., z(ℓ,m) t ∈ R D and D is very small.Throughout our experiments, we set D = 1.
Why tiny-attention?Smaller dimensionality means fewer trainable parameters and less computation cost; thus it is preferred.We believe that small dimensionality is sufficient in our setting because of two key observations.First, a smaller model often tends to work as well as a larger model if it is allowed to use a larger context; see section 1 for the example of language modeling (Dai et al., 2019;Wu et al., 2022;Yang et al., 2019). Second, Hu et al. (2021) found that feed-forward adapters can achieve competitive results with extremely low-rank (1 or 2) parameter matrices.Similar to the LoRA method by Hu et al. ( 2021), our tiny-attention adapter essentially performs a lowrank non-linear projection: it first linearly transforms the high-dimensional embeddings z (ℓ) to low-dimensional query, key, and value vectors; it then linearly transforms the low-dimensional attention vectors-after the non-linear attention operation-to the high-dimensional modification vectors z(ℓ) ; see Appendix A for technical details.

Multiple Heads as a Mixture of Experts
The multiple attention heads in our tiny-attention can be regarded as a mixture of experts where each head is an expert that specializes in capturing certain kinds of contextual information (e.g., syntax, semantics).The output projection in equation ( 1a) learns to aggregate the information z(ℓ,m) t produced by the experts.Rearranging that equation gives z(ℓ) where the per-head matrices O (ℓ,m) are defined such that It inspires us to propose a parameter-averaging trick that is able to further reduce the storage and computation cost of our method.Precisely, after training, we average the output projection matrices O (ℓ,m) as well as the attention parameters inside Attn (m) across the attention heads.Then we only store the averaged parameters.During inference, we only use a single attention head into which the stored parameters have been loaded.That way, although we may have trained M > 1 attention heads, our storage and inference cost will be as low as if we had only trained a single head.The technical details of this trick is discussed in Appendix A.

Related Work
There are three major paradigms of parameterefficient transfer learning.The first is to only finetune a small subset of the existing parameters in a pretrained language model (Howard and Ruder, 2018;Lee et al., 2019;Zaken et al., 2021).The second is adapter-tuning (Houlsby et al., 2019): inserting adapters (i.e., small neural nets) into the language model and only tuning their parameters.The third is prefix-tuning (Li and Liang, 2021;Hambardzumyan et al., 2021;Liu et al., 2021b,a;Lester et al., 2021): augmenting the input sequence with trainable tokens and only updating the new token embeddings.Both adapter-tuning and prefix-tuning keeps the PLM frozen.Our work falls into the category of adapter-tuning.The key difference is that our proposed adapter has an attention architecture.The previously proposed methods in this direction all use feed-forward neural networks: they are Houlsby et al. ( 2019 AdaMix (Wang et al., 2022) proposes a stochastic routing strategy to mix an ensemble of adapters and it is orthogonal to all the adapter methods including ours.Akin to our parameter-averaging trick, they propose to average the adapter parameters for low-cost storage and inference.Similar tricks are used in Ravi (2017); Matena and Raffel (2021); Wortsman et al. (2022).

Experiments
We evaluated our proposed tiny-attention adapter on a range of natural language understanding tasks including GLUE and FewGLUE.Our method is implemented in PyTorch (Paszke et al., 2019) and heavily relies on HuggingFace (Wolf et al., 2020).Our code is submitted for review and it will be publicly released after the paper is published.
In all of our experiments, we set the dimensionality of each tiny-attention head (i.e., the dimension of query, key, and value vectors) to be one.Other experiment details (e.g., hyperparameters) can be found in Appendix B.

Main Results: GLUE and FewGLUE
GLUE.On the GLUE benchmark, we chose the RoBERTa model (Liu et al., 2019) as our PLM and we used the pretrained roberta-large weights (355M parameters) downloaded from HuggingFace.
Our results on the GLUE benchmark are already presented in Figure 1.As we can see, our method (Tiny-Attn-1H and Tiny-Attn-4H) outperforms all the previously proposed parameter-efficient tuning methods as well as fine-tuning.Yet our method uses significantly fewer trainable parameters than the other methods except WARP.The single-head version (Tiny-Attn-1H) trains 176K parameters, which only counts as 0.05% of the PLM parameters.The four-head version (Tiny-Attn-4H) further improves the performance with an increased training cost, but its storage and inference cost remains the same as the single-head version.
Our method only underperforms AdaMix, which learns a stochastic routing strategy to mix an ensemble of adapters.But AdaMix uses significantly more trainable parameters than ours.Moreover, AdaMix's technique is orthogonal and complementary to most other adapter methods including ours.
Our results on each individual GLUE task can be found in Tables 8 and 9 of Appendix C.1.

FewGLUE.
We also evaluated our method on the CB and RTE tasks of the FewGLUE benchmark.They are extremely few-shot settings: each task only has 32 training examples.We chose AL-BERT (Lan et al., 2019) as our PLM and we used the pretrained albert-xxlarge-v2 weights (223M parameters) downloaded from HuggingFace.The detailed setting can be found in Appendix B.1.The result is shown in Table 1.It turns out that the performance of our method is comparable to that of PET (Schick and Schütze, 2021) and GPT-3 (Brown et al., 2020).4.2 Analysis Sequential vs. parallel.In section 2, we presented the 'sequential' methods where our tinyattention modules are placed between the pretrained attention and feed-forward net.Another option is to put the tiny-attention module in 'parallel' to the original attention layer as in He et al. (2021), as illustrated in Figure 3.We emperically found that these two design choices have negligible differences in results.Detailed Results are listed in Table 10 of Appendix C.2.
Effects of parameter-averaging.Recall that we used the parameter-averaging trick in our experiments for an improved inference efficiency.But do we lose any performance by using this trick?Through ablation studies on CoLA and RTE, we found that using our parameter-averaging trick actually slightly improves the results.Detailed results are shown in Table 11 of Appendix C.1.
Does the size of PLM matter?Parameterefficient tuning methods are known to suffer performance drop when working with small-sized PLMs.To investigate this effect, we also experimented with the pretrained roberta-base (125M parameters) downloaded from Huggingface on the MNLI and SST-2 tasks.The results are shown in Figure 4. Different methods suffer almost the same amount of performance drop.1But our method enjoys a much larger drop in the number of trainable parameters: the trainable parameters of our method  Effects of larger dimensions.We set D = 1 in all our main experiments.It's natural to ask that what if we use a larger dimension.We experimented with a D = 4 variant of our method, which we call Tiny-Attn-1H4D, on CoLA and RTE.
We found that this variant performs slightly worse than the standard version Tiny-Attn-1H.It's normal that more parameters in the adapters could lead to slightly worse performance on some GLUE tasks, see Hu et al. (2021).Detailed results are shown in Table 12 of Appendix C.1.

Conclusion
In this paper, we presented the tiny-attention adapter.While previous adapter-tuning only processes in-place embeddings, our method considers the context from other positions.Thanks to this contextual modeling, the size of our tiny-attention adapter can be extremely light (e.g., only 1 attention head and 1 head dimension).To further enable trade-off between performance and training cost, we proposed the weight-averaging technique.On GLUE benchmark, our tiny-attention adapter achieved better results than full fine-tuning while only updating 0.05% of parameters.Our model also achieved competitive results under the fewshot setting.Lastly, we compared our methods with the alternatives (e.g., parallel instead of sequential, without parameter averaging) and also showed the generalization to smaller PLMs.

Limitations
Our main limitation is that our method was only evaluated on the GLUE and FewGLUE benchmarks and that we haven't experimented with a diverse set of generation tasks (e.g., XSUM (Narayan et al., 2018), E2E (Novikova et al., 2017a)) yet.He et al. (2021) reported that the state-of-the-art parameter-efficient adaptation method on a task or dataset may suffer a sharp performance drop on another task or dataset.Although our method is consistently effective across multiple classification tasks, it is still possible that it won't perform well at generation tasks such as summarization and translation.

Ethics Statement
Our method belongs to the general category of efficient language model fine-tuning and our focus is to reduce the number of trainable parameters.
It can benefit the scenarios when communications become a bottleneck, such as federal learning, distributed training, and edge computing.However, since we do not apply explicit differentiable privacy methods to these updated parameters, the method can be vulnerable to specific attacks (e.g., man-inthe-middle attack).
Our method can also be deployed to on-device applications, where the storage is limited.For these applications, different tasks can just keep a small set of task-specific parameters while other parameters are shared.
Our method can help reduce the carbon footprint of the model training in two ways.Firstly, the model is fine-tuned from a pretrained model thus can be adapted to a downstream task quickly while reaching a satisfying performance.Secondly, our method only updates a small portion of all parameters thus the optimizer only tracks a small part of parameters.For this reason, the same training infrastructure can support larger batch size given the optimizer's states are significantly reduced (e.g., to less than 1% in our Tiny-Attn-1H method).
Meanwhile, our method shares the same possibilities as most of previous efficient training methods, such as misusage, containing data bias, and suffering from adversarial attacks.However, the method developed in this paper is orthogonal to the previous effort to mitigate the above issues.
warmup with approximately 10% of the total steps as the warmup steps in some tasks.The number of epochs was fixed to 20.We evaluated on the validation set twice per epoch and report the best result.We used the same batch size 16 for all our tasks.For reproducibility, we used a fixed random seed.Learning rate (lr) was selected from [1E-6,1E-5,1E-4,5E-4,8E-4,1E-3,1.5E-3,2E-3,3E-3,5E-3,7E-3] and weight decay (dec) was chosen from [0,0.01,0.02,0.03,0.04,0.05,0.1,0.2,0.3,0.4,0.5].Note that we did not do a full hyperparameter search.On average, we run about 80 experiments for each task.However, the detailed numbers of experiments are varying according to the size of the dataset to save computational cost.E.g., on MNLI and QQP, we only perform about 20 experiments for each.We performed our experiments mainly on 10 Nvidia RTX A4000.The average running time is about eight hours, varying across tasks.
For Tiny-Attn-1H, we initialized the output projection matrices to be very small (U(− 0.01 √ D , 0.01 √ D ) where D is the per-head dimension) to make our model behave like the original pretrained model at early stages of training.Empirically, we found this trick important in stabilizing the training.The hyperparameters of the best-performing single-head models are shown in Table 4 For Tiny-Attn-4H, we perturbed the weight of Tiny-Attn-1H to initialize every head due to computational limits.In principle, we only need to initialize the weight of every head to be almost the same.This makes the parameter-averaging trick applicable since the weight of these heads would not be very far from each other, just as when we average the parameters of multiple fine-tuned models with the same initialization (Neyshabur et al., 2020).The hyperparmeters we use are shown in 2021) is a subset of the SuperGLUE benchmark (Wang et al., 2019) with the sizes of all training sets being 32. 4 We evaluated our model on two tasks of it: CB (De Marneffe et al., 2019) and RTE.
We still used AdamW but we did not use a scheduler.We used a linear warmup with 10% of the total steps as the warmup steps for RTE.The number of epochs was fixed to 20.Following Schick and Schütze (2021), we did not use any additional validation set for parameter selection or early stopping and we only report the final result.In contrast to them, we did not use any additional unlabeled examples.We used a fixed learning rate 1E-3 and a fixed batch size 1.We've tried a few different weight decay [0,0.02,0.05,0.1],but eventually we chose the same value 0.02 for all tasks.After selecting the hyperparameters with a fixed random seed, we ran over 4 additional random seeds and report the average performance and the standard variation.
Following Schick and Schütze (2021), we used albert-xxlarge-v2 (Lan et al., 2019) downloaded from the HuggingFace library as the backbone PLM, and used manual prompts to replace the classification head.The manual prompt we use was "[CLS]<hypothesis>?[MASK].<premise><SEP>", the same as theirs.The language model predicts the token at the masked position and the output space is limited to a few selected tokens.We used "yes" to represent "entailment", "no" to represent "contradiction", and "maybe" to represent "neutral".

B.2 Analysis
Sequential vs. parallel.In the main experiments, we used a 'sequential' (seq) structure where our tiny-attention modules are placed between the pretrained attention and feed-forward net.Another option is to put the tiny-attention module in 'parallel' (para) to the original attention layer as in He et al. (2021).
We used the same setting as our main experiments on GLUE except that we used roberta-base as the backbone PLM.Warmup was used for all experiments.The hyperparameters we used are shown in Table 6 Note that there are other possible placements, e.g., we can place tiny-attention adapters above the feed-forward networks.We did not do this because our philosophy is to augment the pretrained attention but not to intervene anything else.
Effects of parameter-averaging.We used the same setting as our Tiny-Attn-4H on GLUE except that we did not use parameter-averaging during inference.Instead, we used all 4 heads as in training.The hyperparameters we used are shown in Table 7.Does the size of PLM matter?The hyperparameters we used are shown in Table 6.

C Results and Analysis Details
We use "Adapter H " and "Adapter P " to denote the Adapter proposed in Houlsby et al. (2019) and Pfeiffer et al. (2021), respectively."Tiny-Attn-kH" represents our method with k attention heads, e.g., 'Tiny-Attn-1H is our method with a single attention head.C.1 GLUE and FewGLUE.
Results in this section are discussed in section 4.1.
GLUE.Table 8 shows our results on GLUE development set with roberta-large as the PLM.Note that WARP used a slightly different set of validation metrics from other methods, but based on our results we can assume that this difference does not make a significant difference in the average score, that's why we directly used its score in Figure 1.

C.2 Analysis
The results in this section are discussed in section 4.2.
Sequential vs. parallel.Table 10 shows our results on SST-2 and MNLI with different structures.We found that these two design choices have negligible differences in results.
Effects of parameter-averaging.We found that using our parameter-averaging trick actually slightly improved the results, as shown in  Effects of larger dimensions We found that Tiny-Attn-1H4D is slightly worse than the standard version Tiny-Attn-1H, as shown in

Stability test
We report the best result with a fixed random seed for our main experiments.To test the stability of our method, we run experiments on SST-2 and MNLI using the same hyperparameters with 5 different random seeds.The results are shown in Table 13.We could see that the results on larger datasets like MNLI are quite stable, while the results on SST-2 have a larger variance.

Generation results
There has been research showing that parameter-efficient transfer methods with good classification performance may not work equally well for non-classification tasks, and vice versa (He et al., 2021).We conducted experiments on E2E NLG Challenge (Novikova et al., 2017b) and found that our method also suffers from this problem.The results are shown in Table 14 Adapting a language model layer by adding our tiny-attention adapter.

Figure 3 :
Figure 3: Illustration of the parallel structure.

Figure 4 :
Figure 4: Performance of different parameter-efficient tuning methods on SST-2 and MNLI with RoBERTabase.The baseline results are from He et al. (2021). .

Table 1 :
Results on the validation set of FewGLUE.We report accuracy for all tasks.

Table 5 :
Hyperparameters for Tiny-Attn-4H.Since we initialize from the weight of Tiny-Attn-1H, tasks on which we don't get an improvement are omitted.

Table 6 :
. Hyperparameters for experiments with robertabase as the PLM.

Table 8 :
Hu et al. (2021)2019)UE tasks.All the runs use roberta-large as the backbone.We report Matthew's correlation for CoLA, the overall(matched and mismatched) accuracy and matched accuracy for MNLI, Pearson correlation for STS-B, accuracy and F1 score for QQP and MRPC, and accuracy for other tasks.Higher is better for all metrics.All the runs follow the setting inHoulsby et al. (2019)except fine-tuning.The results of WARP and AdaMix are taken from their own paper respectively.Other results are taken fromHu et al. (2021).

Table 9 :
Hambardzumyan et al. (2021)sks.We use the same setting as on the dev set, different from the fine-tuning baseline and WARP.The results other than ours are published inHambardzumyan et al. (2021).We don't show the result of WNLI but it's considered in the final score.

Table 10 :
Performance of Tiny-Attn-1H with different structures on SST-2 and MNLI.We report accuracy for SST-2 and matched accuracy for MNLI.

Table 11 :
Results of Tiny-Attn-4H with different parameter-averaging settings on CoLA and RTE.We report Matthew's correlation for CoLA and accuracy for RTE.

Table 12 :
Results of Tiny-Attn-4H with different parameter-averaging settings on CoLA and RTE.We report Matthew's correlation for CoLA and accuracy for RTE.

Table 13 :
Stability test.We report accuracy for SST-2 and matched accuracy for MNLI.Seed-1 is the seed we used in the main experiments.
. Our result is comparable to adapter but worse than prefixtuning.

Table 14 :
Results on E2E NLG Challenge benchmark with gpt2-medium as the PLM.We report the BLEU score computed by the official evaluation script.