Empowering Parameter-Efﬁcient Transfer Learning by Recognizing the Kernel Structure in Attention

The massive amount of trainable parameters in the pre-trained language models (PLMs) makes them hard to be deployed to multiple downstream tasks. To address this issue, parameter-efﬁcient transfer learning methods have been proposed to tune only a few parameters during ﬁne-tuning while freezing the rest. This paper looks at existing methods along this line through the kernel lens . Motivated by the connection between self-attention in transformer-based PLMs and kernel learning, we propose kernel-wise adapters , namely Kernel-mix , that utilize the kernel structure in self-attention to guide the assignment of the tunable parameters. These adapters use guidelines found in classical kernel learning and enable separate parameter tuning for each attention head. Our empirical results, over a diverse set of natural language generation and understanding tasks, show that our proposed adapters can attain or improve the strong performance of existing baselines.


Introduction
Transfer learning using large-scale transformerbased pre-trained language models (PLMs) (Radford et al., 2019) has become the standard scheme for various natural language processing (NLP) tasks. Among many strategies, fine-tuning these PLMs emerges as the predominant strategy to adapt the generic models to a specific task (Howard and Ruder, 2018). However, deploying these models is a challenge as curating customized models across a wide variety of tasks would lead to scalability issues. It requires one to store (and sometimes move) multiple copies of the PLM parameters for different tasks, which is inefficient.
A popular approach to tackling such scalability issues is to make the PLM-based transfer learning more parameter-efficient. This can be done by freezing most of the PLM parameters and inserting small trainable modules into the PLM. Adapters (Houlsby et al., Pfeiffer et al., 2020;Hu et al., 2021) and Prefix-/Prompt-tuning (Shin et al., 2020;Li and Liang, 2021;Lester et al., 2021;Liu et al., 2021c,b) have emerged as the prominent approaches under this paradigm. These methods are incredibly parameterefficient and have comparable performance to full fine-tuning models on many common NLP tasks (mainly in Natural Language Understanding) by tuning only 0.1-3% task-specific parameters of the original PLMs.
However, most of these studies take the PLMs as a black box, i.e., these methods are not customized to transformers. This raises whether parameter-efficient fine-tuning has fully utilized the transformer structure in PLMs. Therefore in this work, we propose kernel-wise adaptation, which recognizes and utilizes the kernel structure within self-attention-the core component in a transformer. Specifically, we take inspiration from recent work that connects self-attention to kernel learning (Choromanski et al., 2020;Chen et al., 2021;Tsai et al., 2019) to treat the different attention heads in a transformer's attention sub-layer as separate kernel estimators. We hypothesize that parameter-efficient tuning can benefit from some useful guidelines in classical kernel learning literature and incorporate them into our proposed methods. These include: 1. By interpreting attention heads as kernel estimators, we design the adaptation to be headspecific; 2. We assign more budgets of tunable parameters to learn the value components in the attention mechanism, which correspond to coefficients in kernel methods.
We discuss these guidelines in detail in § 4.1. We also evaluate our hypotheses through rigorous empirical evaluation. First, we test the effectiveness of the two guidelines above by comparing the default LoRA (Hu et al., 2021)-a state-of-the-art approach for efficient adaptation-and two of its variants that implement the two guidelines, respectively. Next, we evaluate our variant of kernel-wise adaptation on three Natural Language Generation (NLG) benchmarks and two Natural Language Understanding (NLU) tasks using the GPT-2 architecture. While parameter-efficient work has extensively covered NLU tasks, it is unknown how well the results transfer to NLG tasks. As language generation typically requires more expressive models, we put more emphasis on multiple NLG tasks that include data-to-text, free-form question answering (QA), and summarization. The empirical results in § 6 demonstrate that with the same parameter budgets, our proposed method can attain better generation quality and classification accuracy than previous techniques, and in many settings, it is close to or even outperforms the full parameter fine-tuning.

Related Work
The literature on parameter-efficient adaptation can be broadly categorized as follows: Adapters. Originally proposed by Houlsby et al. (2019); Pfeiffer et al. (2020), adapters modulate the output of a transformer layer by inserting small Multi-layer Perception (MLP) bottleneck layers. Recent work has proposed many variants of the original adapters, including dropping adapters across several layers (Rücklé et al., 2020) or constraining adapters to be low-rank operators .
A recent line of work focused on identifying the important subset of parameters within the PLMs. Ben Zaken et al. (2021) proposed to only tune the bias terms in the PLMs. MPOP (Liu et al., 2021a) suggested decomposing the weight matrices in PLMs through matrix product operators (MPO) and only trained the matrices of small size (freezing the large matrices) obtained from the decomposition, which implicitly recognizes the small matrices as the important subset.
Low-rank adaptation (LoRA) (Hu et al., 2021) directly assumes that the update of the weight matrices during training can be approximate low-rank, and accordingly proposed to re-parameterize the original weight matrix W by W + BA, where W is frozen to its pre-trained weights whereas A ∈ R r×N h p , B ∈ R N h p×r are updated in training 1 . We note that LoRA too introduces new weights A and B similar to adapters, but they are used only to re-parameterize the existing weights and do not add extra sandwiched layers that modify the original model architecture.
Prefix-tuning. Originally shown in GPT-3 (Brown et al., 2020), prompts are extra tokens that help in the task adaptation of PLMs. Transitioning from the manual design of prompts, Shin et al. (2020) searched for the prompts over the discrete space of tokens based on the task-specific training data; Li and Liang (2021); Lester et al. (2021);Liu et al. (2021c,b) further extended the search space to continuous prompts and tuned the prompts through back-propagating the error in training. Prompt-based methods have been shown to be similar to adapters by (He et al., 2021).

Preliminaries
We start by providing a brief introduction to the transformer architecture ( § 3.1) and then revisit the connection between attention and kernel estimators ( § 3.2), building on which we propose the kernelwise adapter in § 4.

Transformer Architecture
Transformers (Vaswani et al., 2017) are composed of L stacked layers, where each layer comprises of a multi-headed attention and a fully connected feedforward network (FFN) sub-layer. 2 The attention sub-layer, assuming N h heads and dimension size p for each head, first maps an input X ∈ R n×N h p into the query (Q), key (K), and value (V ) matrices through the following affine transformations: are the bias terms 3 . After the transformation, the three components Q, K, V are split into N h blocks corresponding to different heads. For example, Q is re-written as In implementation, LoRA also trains the bias terms in the linear transform besides the matrices A, B, while for brevity, the bias terms are omitted throughout the paper.
2 For simplicity we omit the cross-attention module in transformer-based encoder-decoder models. 3 To ease the notations we adopt the setting where X, Q, K, V have the same shape. n-by-p matrix, and W q are the corresponding parts in W q , b q . The attention output for the h th head is then computed as: where ii is the sum of the i-th row in M (h) , corresponding to the normalization part in softmax.
After we obtain the outputs in each head, they are concatenated as, followed by the overall output, where W o and b o are similarly sized as the other matrices in Equation (1).

Attention as Kernel Estimators
For each head in the attention module, we have given the expression of attention output in Equation 2. In this subsection, we will re-write attention as a kernel estimator to show the connection. In computing the attention output (of a single head), we have a length-n input sequence {x i } n i=1 (the rows in X) and accordingly we can obtain N 4 key vectors {k j } N j=1 ⊂ R p (from the key matrix K) and query vectors {q i } n i=1 ⊂ R p (from Q). 5 The original goal of self-attention is to obtain the representation of each input token x i : g(x i ). By denotation exchange: q i := x i and f (q i ) := g(x i ), we can also understand the aforementioned selfattention module as returning the representation f (q i ) of the input query vector q i through {k j } n j=1 , which behaves as a kernel estimator (Choromanski et al., 2020;Peng et al., 2020;Chen et al., 2021). Specifically, for a single query vector q i , a Nadaraya-Watson kernel estimator (Wasserman, 2006, Definition 5.39) models its representation as, . 4 Note that N may not always equal n, such as in cross attention (N = n) or in prefix-tuning (N > n due to the prefix prepended to the key matrix) (Li and Liang, 2021). 5 In this subsection we omit the superscript (h) for simplicity since the discussion is limited within a single head Here, κ(·, ·) is a kernel function, and c j 's are the coefficients (c j can either be a scalar or a vector in different applications) that are learned during training. In this estimator, {k j } n j=1 serve as the supporting points which help construct the representation for an input q i .
For kernel function κ(x, y) = exp x, y / √ p , we slightly abuse the notation κ(Q, K) to represent an n-by-N empirical kernel matrix, whose element in the i-th row and the j-th column is . With these notations, the representation of the whole sequence Q will be, where D is a diagonal matrix for row normalization in Eq. (4), and C is an N -by-p matrix whose j-th row is c j . Considering the correspondence between Equation (5) and the standard softmax attention in Equation (2), we can have a finer division of the attention module: the empirical kernel matrix κ(Q, K) (D is decided by κ(Q, K)) and the coefficient part C, which includes but is not limited to value matrices in attention (see § 4). In what follows, we will discuss how we build adapters for these two parts.

Method
We introduce our Kernel-mix method in this section, which builds upon the proposed Kernel-wise adaptation. To explain the principle behind Kernelwise, we first illustrate the guidelines we aim to adopt in our adapter design and show that existing methods fail to satisfy them ( § 4.1). With these details, we finally propose our method in § 4.2 and § 4.3.

Guidelines Motivated by Kernel Learning
Given the analogy between attention in PLMs and kernel estimators, we hypothesize that parameterefficient adaptation should be aware of this connection in transformer-based PLMs and utilize desirable guidelines emerging from the literature on kernel learning. Here we discuss the guidelines introduced in § 1 in further detail. Guideline-1 suggests that the adaptation should be head-specific. Conceptually, different heads correspond to different empirical kernel matrices (distinct distribution of attention scores), and it will be beneficial to adapt the attention module in a head-specific manner. The effect of head-specific adaptation is also observed by other work, e.g., (He et al., 2021) that mentioned multi-head influence can make methods such as prefix-tuning more expressive.
Guideline-2 is that we should assign more parameter budgets to the coefficient (or value) part of attention compared to the empirical kernel matrix part (query and key). This guideline comes from the classical optimization procedure in kernel learning (Wasserman, 2006, Definition 5.29) where we fix the kernel in use and only perform the unconstrained optimization for the coefficients (c j 's in Equation (4)). This practice in kernel learning can be justified by Representer Theorem (Schölkopf et al., 2001) that the minimizer f * of some certain empirical risks admits a representation of the form: where α j 's are the free parameters to optimize. Representer Theorem indicates that the target estimator f * (·) is simply a linear combination of κ(·, k j )'s, and therefore many kernel methods focus on optimizing the coefficients α j 's. Revisiting Equation (4) under the setting of transformers, which motivates us to apply Guideline-2 to better model the tunable coefficient part c j 's in attention.
In addition, kernel learning theory concludes that the sample efficiency of a Nadaraya-Watson kernel estimator is mainly influenced by its bandwidth (the scaling factor, corresponding to the factor 1 √ p in Equation (2)), rather than the concrete form of the empirical kernel matrix κ(Q, K) (Wasserman, 2006, Section 5.4). This implies that the adaptation to the empirical kernel matrix can be conservative. This is also similar to the conjecture by (He et al., 2021), which mentions "attention learns pairwise positional interactions which do not require large capacity for adapting to new tasks."

Do Existing Adapters Satisfy the Guidelines?
Adapters are designed to modify the hidden states in a certain step in a transformer, and their mechanism can be stated as, where H is the "hidden state" in a certain step, and ∆H is the update given by the adapter. As shown in (He et al., 2021), this definition embodies most of the recent proposals for efficient adaptation, such as adapters, prefix-tuning, LoRA, and similar variants.
The original MLP-based adapters, which only adjust the output of a particular layer (Houlsby et al., 2019), do not modify the empirical kernel matrix in the attention sub-layer.
Prefix-tuning satisfies the first guideline as it is head-specific by nature (it prepends trainable continuous prefixes to both key and value matrices in each head). However, it fails to satisfy the second guideline as it enforces an equal assignment of tunable parameters to both the kernel and the coefficient parts (since the prefixes for key matrices and value matrices correspond to each other).
As for weight-updating adapters, such as LoRA (Hu et al., 2021), its proposed setup disobeys both guidelines. The original LoRA updates the whole weight matrix, which is not headspecific. To explain this, consider the weight matrix W q as an example 6 . In each training step, LoRA updates W q with a low-rank matrix BA of the same size as W q . However, if we denote the We observe that the updates for all the heads share the same column space spanned by B. In the extreme case of rank-1 B (for example), the updates for each column in the weight matrix will be in the same direction, which is not ideal for adapting all the heads. Further, as suggested by Hu et al. (2021), LoRA evenly assigns the parameter budgets to the weight matrices for Q and V , which deviates from the second guideline.

Kernel-wise Adaptation
We choose LoRA as our primary base to develop our method because of its flexibility in assigning parameters to different weight matrices-both in empirical kernel matrix and coefficient components.

Guideline-1: Head-specific Adaptation
To incorporate Guideline-1, we extend the framework of LoRA and propose Kernel-wise 7 that sat- isfies the first guideline.
Here, we allow the low-rank weight matrix updates for each head to have customized column spaces, by training distinct B (h) ∈ R N h p×r for head-h, ∀h ∈ [N h ]. In this case, the weight matrix W (h) in head-h would be updated by and is expected to be more expressive due to the non-shared column spaces (see Figure 1(b)).
On the downside, this design suffers from inflexibility with a small parameter budget. For all the B (h) 's, to provide rank-r updates in each head, the new adaptation takes around N 2 h pr parameters. However, if, for instance, only 4N h p parameters are assigned to modulate a weight matrix, we can still implement the original LoRA by using rank-2 A, B, while the construction of Kernel-wise would be prohibited since even rank-1 updates in each head will require more parameters than the budget.
To resolve the issue, we provide a lightweight alternative to the head-specific adaptation above, which we call Kernel-wise-lite. In this version, we propose to use the frozen W (h) k ∈ R N h p×p as the head-specific basis for head-h (W (h) k is the h-th block in the weight matrix W k , c.f. § 3.1). Therefore any target weight matrix W (h) would be updated by k is a p-by-r matrix (see Figure 1(c)).
k allows the adaptation to be head-specific while containing the same number of trainable parameters as the original LoRA, which is 2N h pr. This comes at the cost of restricting the basis spaces of updates for each head from the unconstrained But why should we choose W k for the lightweight updates? The motivation comes from results in kernel learning that encourage adapting of the coefficient part using the basis spaces spanned by key matrices. In a kernel estimator, the coefficient C is independent of the query sequence Q as it is trained solely by the supporting points K-asymptotically, c j , the j-th row in C, is only decided by k j , j ∈ [N ] (Yang et al., 2017). Concretely, given the loss function and the kernel function, c j is in general influenced by k j and K −j (all the points except k j ), while when N → ∞, K −j can be fully specified by a fixed distribution. This implies that in attention, compared to queries, keys are more related to values, and following which, we turn to W k to form the basis for the low-rank updates for the conceptual motivation.
Combining LoRA with Kernel-wise. Our proposed adaptation can make fine-grained adjustments to each attention head and thus improve the representations. However, increased representation power might at times trade-off with lower-rank updates. As such, we propose our main variant-Kernel-mix, to combine the original LoRA and Kernel-wise, to attain the best of both worldslarger basis (therefore higher rank updates) and specific adaptation to each head. Its update expression is as follows, where B LoRA is shared among all the heads, while B (h) 's are head-specific. We remark Kernelmix also has a lightweight alternative, Kernel-mixlite, which is the combination of the original LoRA and Kernel-wise-lite. We compare different variants through experiments in § 6.

Guideline-2: Making Coefficients more Expressive
To incorporate Guideline-2, we propose to make coefficients more expressive by allowing the modification of both W v and W o for the coefficient part, compared to only updating W v in the original LoRA. We achieve this by re-writing the attention sub-layer under the kernel estimator perspective, which extends the scope of attention by including W o in its head-specific computation as well. If we represent W o as, ) in the attention output matrix, we can re-write the attention sub-layer as, and propose to take each summand as the complete form of a head (kernel estimator). We thus extend the coefficient part from the value matrices to the matrix products V (h) W (h) o 's, which naturally results in N -by-N h p coefficients (with rank-p).

Final Model
Combining the two pieces together, we report the concrete adaptation scheme under two settings: • With very limited parameter budgets (less than 0.2% of the total PLM parameters), similar to LoRA, we modify W q and W v with equal parameter budgets using Kernel-mix-lite(qv). (The suffix (qv) means the method will adjust W q and W v .) In this case, we omit Guideline-2 and only incorporate Guideline-1 to apply head-specific updates.
• With intermediate parameter budgets (around 1.6% of the total parameters in the PLM), we suggest using Kernel-mix(qvo), which instead modifies W q , W v , and W o , assigning more budgets to the coefficient part (Guideline-2). Given the increased parameter budgets, we allow Kernelmix scheme for W q and W o , while continue to utilize Kernel-mix-lite scheme for W v . Kernelmix(qvo) incorporates both the guidelines.

Experiments
While parameter-efficient tuning methods have been extensively studied for NLU tasks, their applicability towards NLG tasks is not well-known. This section performs empirical experiments of our proposed methods on three NLG tasks. To show the consistent effectiveness of our methods, we provide results on two NLU tasks as well. 8

Experimental Setup
We mainly evaluate the performance of our methods on NLG tasks, in which there is still a gap between fine-tuning and most parameter-efficient adaptation techniques ( Our experiments mainly follow the setting used by Lin et al. (2020), which takes GPT-2 SMALL (124M parameters) (Wolf et al., 2019) as the backbone for all the NLG tasks. We choose the smaller model size as compared to the larger models used in other related studies it is generally difficult for smaller models to attain the same performance as full-model fine-tuning (Lester et al., 2021;Liu et al., 2021b). This creates a challenging testbed to evaluate our proposed approaches.
In addition to the NLG experiments, we also study two NLU tasks: MNLI (Williams et al., 2018) and SST2 (Socher et al., 2013), to show the performance of our methods on encoder-only transformers. The Multi-Genre Natural Language Inference (MNLI) Corpus (sentence pairs of hypotheses and premises with entailment annotations) will be given, and the task is to predict whether the premise entails, contradicts, or is neutral to the hypothesis; the Stanford Sentiment Treebank (SST2) is composed of movie reviews and corresponding human-annotated sentiment and specifies a task to predict the sentiment of a review sentence (positive/negative). We implement the backbone under the setting used by He et al. (2021) and use RoBERTa BASE (125M parameters) (Liu et al., 2019) for both MNLI and SST2.

Baselines
We compare our method with several other representative methods: fine-tuning (Howard and Ruder, 2018), adapters (Houlsby et al.  Table 1 we use a postfix of adapters / LoRA / prefix-tuning to indicate their bottleneck size / rank of updates / prefix length, respectively. For instance, Adapter-4 means that the bottleneck size of the two-layer MLP in the inserted adapter is 4. For some adaptation techniques, the number of parameters to train is not flexible to tune. For instance, Bitfit proposes to tune all the bias terms within the PLMs, and, as a result, the maximum parameters to tune are limited; as for Compacter, the weight matrices in the adapter modules are constructed through the Kronecker product, and the parameter complexity is O( L n + n 3 ) (Mahabadi et al., 2021), where n is the size of a square matrix used in the Kronecker product and L is the number of layers. We choose a particular setting to make the number of trainable parameters in Compacter close to Bitfit and a tiny size adapter. The parameter size of Compacter is not further increased since a larger n would significantly retard the training.
For prefix-tuning, Li and Liang (2021) suggest to utilize a re-parametrization trick to mitigate its initialization issue, and therefore, the number of parameters to train will be much larger than the actual number of parameters to store, while these two numbers are the same for all other methods. In deciding the model size, we manage to make the number of parameters to store in prefix-tuning roughly the same as its adapter counterpart by adjusting the prefix length.

Main Results
Table 1 compares our proposed methods against other baselines on the aforementioned generation tasks. The performance of our proposed Kernelmix method on text classification is reported in Table 2. We summarize our observations as follows.
Tasks with long input. As shown in Table 1, all previous parameter-efficient methods fail to attain comparable performance to fine-tuning on CoQA and CNN/DM tasks which have a longer input than the table-to-text generation task WebNLG. This indicates that current parameter-efficient methods still fall behind fine-tuning in those more challenging generation tasks with longer sequences. However, Kernel-mix, encloses this gap and even outperforms fine-tuning in some NLG tasks, e.g. the CoQA task, with solely 1.61% tunable parameters of GPT-2 SMALL .
Impact of parameter sizes. The results in Table 1 show that overall, as the tunable parameter size increases, the performance of various parameter-efficient methods also increases, getting closer to fully fine-tuning. This indicates that a large enough parameter budget is still a prerequisite for the excellent performance of parameterefficient adaptation methods. This finding can help us explain why the performance of Compacter can be better than Adapter-4 over all the tasks when the parameter budget is tiny (Tiny Budget < 0.1% in the second group in Table 1), considering that the Compacter can construct a larger MLP than the adapter with the same parameter budget due to the usage of Kronecker product. Inspired by this finding, we can also clearly see the limitation of Bitfit and Compacter as their parameter budgets are constrained to be small and cannot be elevated of free will. As for the original LoRA, the impact of parameter sizes is somewhat tricky-LoRA-4 shows competitive performance while LoRA-54 is not improved as greatly as other methods on WebNLG and CoQA. A similar phenomenon on different datasets is also observed by Hu et al. (2021);He et al. (2021).

Performance of our proposed Kernel-mix(qvo).
Our proposed Kernel-mix(qvo) can generally improve the performance on all three NLG datasets. On WebNLG, Kernel-mix(qvo) provides a 1.1 increase in BLEU score compared to Adapter-108 and a 2.6% increase compared to LoRA-54; on CoQA, our method is even more greatly better than LoRA-54, obtaining 2.7 exact-match and F1 improvement, and even outperforms fine-tuning by around 1% in both of the metrics; On CNN/DM, all the parameter-efficient methods have close performance, while through a test, we show our method Kernel-mix has a significantly higher Rouge-2 score than the best baseline, LoRA-54. Overall, Table 1 demonstrates that kernel-mix adaptation better exploits the attention structure in PLMs and improves the overall generation quality under all three kinds of parameter budgets.  Performance on NLU tasks. Table 2 shows the performance of Kernel-mix(qvo) when it is extended to encoder-only transformers. For a fair comparison, we specify a new parameter budget (0.5%) for Kernel-mix(qvo), different from the previous settings in Table 1. With the new budget, Kernel-mix(qvo) attains close accuracy to the other parameter-efficient methods on both MNLI and SST2. We remark that the parameter budget used here is slightly tight for Kernel-mix(qvo), as the ranks assigned for head-wise adaptation (1 for W q and 2 for W v , W o ) are limited (see Table B.5).

Ablation Studies
Besides the main results in Table 1, we also perform ablation studies to verify the effectiveness of our propositions. We additionally implement four variants of Kernel-wise to help ablate the effects of our proposed guidelines. Among the new variants, Kernel-wise-lite(qv), Kernel-wise(mq), and Kernel-wise(mv) only adjust W q , W v ; Kernelwise-lite(qv) takes the strategy in LoRA-4 to evenly assign parameters to W q and W v ; Kernelwise(mq) leaves more budget to W q than W v with a ratio of 3:1, while Kernel-wise(mv) is set up in the reversed way. In contrast, Kernel-wise(qvo) simultaneously adjusts W q , W v , and W o (with a budget ratio of 5:1:10). The experimental results are summarized in Table 3. The settings of the variants designed for ablation are described in Appendix B.4.
Head-specific adaptation (Guideline-1). For "LoRA-4", the setting recommended by Hu et al. (2021), we compare it with its head-specific  counterpart-Kernel-wise-lite(qv). In almost all the tasks, Kernel-wise-lite(qv) can attain better performance with the same number of parameters.
More parameters for the coefficient part (Guideline-2). Hu et al. (2021) (Section 7.1) indeed have already done some preliminary exploration to find the relatively more important weight matrices in transformers. Their experimental results (copied as Table C.6 in Appendix C) clearly show that "putting all the parameters in ∆W q or ∆W k results in significantly lower performance". In this work, we additionally show that by simply moving some trainable parameters from Q, the empirical kernel matrix part, to V , the coefficient part, Kernel-wise(mv) can improve the performance upon Kernel-wise(mq) as well.
Extending the scope of attention. To show the benefits of both adjusting W v and W o , we compare the new variant Kernel-wise(mv) against Kernel-wise(qvo). They both assign more budgets to the coefficient part; Kernel-wise(qvo) would update both W q , W v and W o , while Kernel-wise(mv) only adjusts W q , W v . We can observe that Kernelwise has better performance in most tasks.
Combining the shared and the head-specific basis. Lastly, we find that Kernel-mix(qvo) (our proposed method) outperforms Kernel-wise(qvo), which justifies combining the two types of basis, as opposed to pure head-specific adaptation.

Conclusion and Future Work
In this work, we revisit the connection between the attention module and kernel estimators, and accordingly propose kernel-wise adaptation, which adopts the guidelines from kernel learning to strengthen the low-rank adaptation (LoRA). We verify that with the same parameter budgets, our proposed adaptation techniques can have better performance on three generation tasks than the existing parameter-efficient methods, including adapters, prefix-tuning, and LoRA, and attain close accuracy on two classification tasks as well.
One possible extension of our work is combining our proposed method with other adapters in feed-forward sub-layers. In MAM-adapter, He et al. (2021) suggest applying prefix-tuning to adapt the parameters in self-attention sub-layers and assigning budgets to feed-forward sublayers as well; it can be beneficial to replace prefix-tuning with Kernel-mix for adaptation in the attention part.
Another direction of future research is the extension of our method to the feed-forward sub-layers, which are interpreted as key-value memories in recent work and behave like attention blocks (Geva et al., 2021). It will be interesting to study if kernel-specific guidelines could help design better adapters employed in the feed-forward layers.

A Dataset Details
• The WebNLG dataset consists of mapping sets of RDF triples to text. The training data are Data/Text pairs where the data is a set of (subject, property, object) triples. There are 9 categories extracted from DBpedia in the train and the development (dev) set, while the test set contains 5 more unseen categories, which can be used to evaluate the generalization of the adaptation methods. We adopt the official evaluation script and reports BLEU (Papineni et al., 2002), METEOR, (Lavie and Agarwal, 2007) and TER (Snover et al., 2006) 10 .
• CoQA is a large-scale conversational question answering dataset. It contains over 127K questions with answers collected from more than 8K conversations. The problem involves generating answers to the questions based on related conversation histories and documents. We follow the official evaluation script and use the macroaverage F1 score of word overlap as the main evaluation metric (Reddy et al., 2019).
• CNN/DM is a benchmark for text summarization, involving more than 300K news articles provided by CNN and the Daily Mail. We report ROUGE-2 scores (Lin, 2004) as evaluation metrics.
trees (3 human judges annotate each phrase). Wang et al. (2019) incorporated the task into the GLUE benchmark, with 67k sentences in the training set, 0.9k in the dev set, and 1.8k instances in the dev set. (Only the dev set is used in Table 2.) The metric is the accuracy of the decision whether the sentiment of a review sentence is positive or negative. rate we use is 0.00125, and the batch size is 16; for CoQA, the learning rate we use is 0.005, and the batch size is 8; for CNN/DM, the learning rate we use is 0.001, and the batch size is 16; for MNLI, the learning rate we use is 0.0002, and the batch size is 32; for SST2, the learning rate we use is 0.0001, and the batch size is 16; We train each variant for multiple independent runs to account for variability. In particular, for WebNLG, we train models over 5 runs, for CoQA 3 runs, for CNN/DM 2 runs, for MNLI 3 runs, and for SST2 3 runs. 12 The reported numbers in Tables 1 and 2 are the mean value averaged over the runs.

B.3 Implementation and Training Efficiency
All the models in this work are implemented by PyTorch. For the devices, we perform the distributed training using 8 Tesla V100 16GB GPUs. On WebNLG, it will take our method around 1 minute to finish one epoch; on CoQA, the time cost is around 20 minute / epoch; on CNN/DM, the training time per epoch will be 30 minutes. For the NLU tasks, the task implementation by He et al. (2021) cannot be adapted to distributed training (can only be trained with one graphic card), and thus the training time is longer: it will take our method around 2 hours to finish one epoch in MNLI, and 25 minutes in SST2.
We additionally report there is actually an implementation trick in Kernel-wise-lite. We can simply associate the h th head's key matrix K (h) to the computation of any weight matrix, say Q (h) , as follows, In that case we can reuse the given K to save the computation of the product X(W (h) k B (h) k A (h) ). 12 We reduce the number of runs for larger datasets given computation budget.

B.4 Specific settings for each method
We report the exact setting for the methods that need further explanation in this subsection. For Compacter, the bottleneck size of the adapter is 192, and the number of components is 4, as suggested by Karimi ; for the original LoRA and the variants of our proposed methods, we summarize their settings in Table B.5. In this table, the numbers in columns Q_wise, V_wise, and O_wise are the rank used for Kernelwise; if the number is followed by "(lite)", we apply Kernel-wise-lite with the listed rank to adjust the corresponding weight matrices. The numbers in columns Q_LoRA, V_LoRA, and O_LoRA are the rank of the update used as in the original LoRA. For Kernel-mix methods, the numbers in columns Q_wise and Q_LoRA (for example) will be nonzero.

C Partial Experimental Results
Reported in LoRA (Hu et al., 2021) For ease of reading, we copy Table 5 from the paper (Hu et al., 2021) as a piece of evidence to show "putting all the parameters in ∆W q or ∆W k results in significantly lower performance".