Hyperdecoders: Instance-specific decoders for multi-task NLP

We investigate input-conditioned hypernetworks for multi-tasking in NLP, generating parameter-efficient adaptations for a decoder using a hypernetwork conditioned on the output of an encoder. This approach produces a unique decoder adaptation for every input instance, allowing the network a larger degree of flexibility than prior work that only produces one decoder adaptation per task. We apply our method to sequence classification tasks, extractive QA, and summarisation and find that it surpasses previous parameter efficient fine-tuning methods and often outperforms fully finetuning the underlying model. An analysis of the embeddings used by our hypernetwork shows that they are sensitive to output label and type, suggesting that our approach better maps from encoder representations to output labels. Our code is publicly available at https://github.com/allenai/hyperdecoders.


Introduction
Recent work in NLP has examined the performance of large pretrained transformer-based models in multi-task settings, where a single model is evaluated on multiple tasks simultaneously, often with tasks converted to a shared sequence-to-sequence format (Raffel et al., 2020;Brown et al., 2020).This greatly simplifies model training and deployment, requiring only one deployed model and format to handle multiple tasks.Additionally, training across multiple tasks can result in greatly improved performance for similar tasks (Phang et al., 2018), as well as tasks not seen during training (Sanh et al., 2022;Wei et al., 2022).However, not all tasks work well together, and jointly training on certain task pairs can reduce performance on both ('negative transfer') (Aribandi et al., 2022).
In this paper, we propose a new method for multi-task NLP using a hypernetwork to generate an instance-specific decoder from the output of an encoder.We effectively explore if a model can learn to adapt itself through learning how to generate adapter layers.Our approach produces significant gains over prior approaches for efficient multi-task fine-tuning, often matching or exceeding full fine-tuning the underlying model.We build on parameter-efficient learning methods, where one trains a small set of parameters within a much larger model.These parameters may be newly introduced (Houlsby et al., 2019;Li and Liang, 2021) or already exist within the model (Zaken et al., 2021), and are kept as few as possible.This means these methods often do not contain the capacity to learn multiple tasks at once, losing potential transfer benefits.One way to remain parameter-efficient while still handling a variety of tasks is to instead learn to generate these parameters, making use of an auxiliary network ('hypernetwork') to generate the weights used during inference (Tay et al., 2021;Pilault et al., 2021;Ye and Ren, 2021;Karimi Mahabadi et al., 2021).This allows the model to benefit from positive transfer between tasks through the shared hypernetwork while reducing negative transfer by allowing the generated parameters to be unique per task.However, hypernetwork-based approaches generally condition their parameters on a learnt task embedding, meaning (a) the model is the same for every example within a task, and (b) adapting to new tasks requires further training to learn new task embeddings.We increase the flexibility of this approach by instead generating unique parameters for every input, allowing the model to make use of similarities between samples across datasets and avoid potential interference between samples within the same dataset.This is achieved by using a shared encoder across all tasks and then generating adapter layers for the decoder only by feeding the encoded inputs into a hypernetwork.Furthermore, by conditioning on inputs rather than task embeddings, our approach allows simple transfer to out-of-domain data, as the shared hypernetwork and encoder learn to map from text to parameters. Figure 1 illustrates our approach.
We apply our approach to a diverse set of tasks, including sequence classification, extractive question answering, and summarisation, and find that our approach outperforms existing parameterefficient approaches and matches or outperforms full-finetuning.An analysis of our approach shows that sharing parameters in the encoder, but generating them in the decoder is more effective than other possible setups, suggesting that the encoder benefits from multi-task training while the decoder does not.Our results suggest that pretrained encoders can be easily adapted and trained to produce adaptations that enhance multi-task transfer learning for decoders, while producing useful adaptations for encoders is much more difficult.
To summarise, our core contributions are: 1. We propose a new method for parameterefficient multi-tasking, generating unique decoder layers for every input into a model.2. We show that our approach performs strongly against other parameter-efficient baselines and fully finetuning the underlying model across a diverse set of NLP tasks.3. We show the embeddings learnt by our hypernetwork are sensitive to both dataset and output label, suggesting the hypernetwork is effectively controlling the decoder.

Encoder-conditioned Decoders
We make use of T5 as the underlying model in our experiments, which is a popular encoderdecoder model for sequence-to-sequence multi-tasking (Raffel et al., 2020) and a common starting point for previous hypernetwork-based approaches (Karimi Mahabadi et al., 2021;Tay et al., 2021).However, our overall approach can be applied to generic encoder-decoder transformer models.

Adapter Layers
We augment our underlying model with adapter layers (Houlsby et al., 2019).These are small bottleneck networks with the following form: Where f is the ReLU activation function.We insert these layers in parallel with the feedforward module of each layer of a larger pretrained model, following He et al. (2022a): Where y is the output for the layer, FF is the feedforward module, and x is the output from the attention module(s).We find using LayerNorm(x) as input to the adapter less effective than directly using x (see section 5.2).During training, the underlying network is frozen and only W u , W d , b u , and b d are updated.

Adapter-generating Hypernetworks
A hypernetwork is a network that produces the parameters used for another network (Ha et al., 2017;Schmidhuber, 1991).We use a simple two-layer network to produce adapter parameters W d , W u , b d , b u .Given some input x to the hypernetwork, we generate these parameters as follows: We re-use this hypernetwork to generate the adapters for every layer by (partially) conditioning the input on layer embeddings, greatly improving the parameter efficiency of this approach.The hypernetwork parameters are initialised using the method proposed in Chang et al. (2020).

Encoder-conditioned Decoders
The core idea of our approach is to condition a hypernetwork on the output of the encoder to generate the adapters used for the decoder in an encoderdecoder model.We place regular (non-generated) adapter layers in the encoder to allow it to adapt to tasks in a parameter-efficient way.We then feed the encoder output to a hypernetwork to generate custom decoder adapters for every input to the encoder.This allows our approach to flexibly differentiate between samples within a task, rather than remaining static per task.This potentially allows more flexible transfer learning both between samples within datasets and with samples across datasets.We explore other potential configurations in section 5.2 and find ours works best overall.We name this approach 'hyperdecoder' in the following experiments.
Concretely, given some input, we first pass it through the T5 encoder to construct a hidden representation h.The T5 encoder is equipped with adapter layers, which are trained during fine-tuning while the rest of the encoder is kept frozen.We mean-pool this representation and pass it through a two-layer network with a ReLU activation to construct a vector embedding of the input: We then generate the parameters for an adapter in each layer i of the decoder by concatenating a learnt layer embedding l i to the embedding and passing it through the hypernetwork described in section 2.2.
All parameters in the decoder are frozen and the hypernetwork is trained along with the encoder adapters during fine-tuning.This approach is summarised in Figure 1.

Multi-tasking
We focus on a multi-task setup, where a model is trained on multiple tasks simultaneously.The underlying model parameters θ are kept unchanged during training and only the adapter and hypernetwork parameters θ ′ are updated.All parameters are shared between all tasks, and the model is trained in a seq2seq setting with cross-entropy loss, following Raffel et al. (2020).

Experiments
This section details our experimental setup and baselines.Results are given in Section 4.

Datasets
We evaluate our approach in three settings: GLUE (Wang et al., 2018), the 2019 MRQA shared task (Fisch et al., 2019), and a set of summarisation and NLI datasets.For each setting, we sample from sub-tasks proportionally to their size, as initial experiments showed more complex sampling techniques provided little to no benefit.All tasks are in English.

GLUE
GLUE (Wang et al., 2018) is a set of sequence classification tasks including paraphrase detection (QQP and MRPC;Dolan and Brockett, 2005), semantic similarity (STS-B; Agirre et al., 2007), natural language inference (MNLI; Williams et al., 2018), (QNLI; Rajpurkar et al., 2016), (RTE; Dagan et al., 2006;Bar Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009), linguistic acceptibility (CoLA; Warstadt et al., 2018), and sentiment classification (SST-2; Socher et al., 2013).Following Karimi Mahabadi et al. ( 2021), for datasets with small training sets (RTE, MRPC, STS-B, CoLA), we split the validation set in half into test and validation sets, for other larger datasets we split out 1000 examples to use as the validation set and use the original validation set as a test set, and for MNLI we use the mismatched validation set as the test set.Following prior work we do not evaluate on WNLI.We preprocess the GLUE inputs to follow the format used by Raffel et al. (2020) (including task prefix).

MRQA
The MRQA 2019 shared task dataset (Fisch et al., 2019) is a collection of 12 QA datasets, all modified to the same format of extractive QA. 6 datasets are used for training and evaluation: HotpotQA (Yang et al., 2018), Natural Questions (Kwiatkowski et al., 2019), NewsQA (Trischler et al., 2017), SQuAD (Rajpurkar et al., 2016), SearchQA (Dunn et al., 2017), and TriviaQA (Joshi et al., 2017).Another 6 are used for out-of-domain evaluation: BioASQ (Tsatsaronis et al., 2015), DROP (Dua et al., 2019), DuoRC (Saha et al., 2018), RACE (Lai et al., 2017), RelationExtraction (Levy et al., 2017), and TextbookQA (Kembhavi et al., 2017).This effectively tests a model's ability to generalise to out-ofdomain data.We evaluate on the validation split of all 12 datasets after all training steps are complete.We preprocess all MRQA data to follow the SQuAD template used by Raffel et al. (2020)  We split contexts into chunks of length 512 tokens with an overlap of 128 tokens.We pair chunks with answers with the answer found in the chunk, and chunks without answers with empty strings.
At evaluation time, we produce an answer for all chunks and take the most likely non-empty string as the final answer.

Summarisation and NLI
To investigate transfer learning with difficult tasks, we evaluate on summarisation and NLI tasks that are known to cause negative interference (Aribandi et al., 2022) and are difficult for current parameterefficient techniques (He et al., 2022a).For summarisation, we use Xsum (Narayan et al., 2018), CNN/Daily Mail (Hermann et al., 2015), and the English WikiLingua split from Gehrmann et al. (2021).For NLI, we use MNLI (Williams et al., 2018), abductive NLI (Bhagavatula et al., 2020), and adversarial NLI (Nie et al., 2020).We jointly train and evaluate on all tasks.We preprocess all datasets following the templates used in Raffel et al. (2020) 2 .We evaluate on the provided test splits for all datasets except abductive NLI and MNLI.For abductive NLI, we split 1000 samples from the validation set and use the existing validation set as the test set.We treat MNLI as detailed in section 3.1.1.

Experimental Details
We build on the Hyperformer codebase (Karimi Mahabadi et al., 2021), making use of the transformers implementation of T5 (Wolf et al., 2020).We finetune all models using AdamW (Loshchilov and Hutter, 2019), with a learning rate of 3e-4 with linear decay and 500 warmup steps.For GLUE tasks, we train for 65k steps with an effective batch size of 128, evaluate every 1000 steps on the development set, and test on the overall best performing checkpoint.For MRQA, we train for 4 epochs and evaluate on the final model (initial experiments showed that taking checkpoints always resulted in evaluating on the final model regardless).For summarisation and NLI tasks, we train for 100k steps with a batch size of 64, evaluate every 5000 steps on the development set, and test using the single overall best-performing checkpoint.All non-hypernetwork and non-adapter parameters are frozen throughout training.All experiments start from the T5 v1.1+LM checkpoints (Lester et al., 2021) unless otherwise stated.Further details can be found in Appendix B.

Baselines
We primarily compare against three strong baselines: fully-finetuning the underlying model, training only adapter layers placed in parallel with all feedforward modules ('adapter') and using taskconditioned hypernetworks to generate adapter layers in the encoder and decoder ('task hypernet'), similar to the Hyperformer (Karimi Mahabadi et al., 2021).Apart from full finetuning, we keep the number of trainable parameters roughly equal across methods.More details are provided in Appendix A.
We additionally compare against relevant prior work.For GLUE, we compare against the Hyperformer and Hyperformer++ models proposed in Karimi Mahabadi et al. ( 2021) with an increased number of trainable parameters3 , and the modular task hypernetwork (Ponti et al., 2022).For MRQA, we compare against CA-MTL (Pilault et al., 2021) sampling technique.We also compare to Uni-fiedQA (Khashabi et al., 2020) results reported by Friedman et al. (2021) to show out-of-domain split performance from alternate T5-based QA model.

Results
This section details our main experimental results.
Ablations and analysis follow in Section 5.

GLUE
We report our results on the GLUE benchmark in Tables 1 and 3. Our approach improves greatly over the Hyperformer and full finetuning when using T5 v1.1 + LM as the underlying model.Notable improvements are made in SST-2 and RTE tasks, the latter of which is known to benefit from transfer learning (Phang et al., 2018).This suggests our approach enhances positive transfer benefits over adapter-only and full finetuning approaches.
The worst performing approach is the 'task hypernet' method, which follows the Hyperformer approach but with our adapter placement.This suggests that the adapter placement difference between our method and the Hyperformer is not the reason for our improved performance, but rather the use of an encoder-conditioned decoder.Additionally, our approach remains parameter efficient, training 0.03× fewer parameters than full finetuning.Our approach also outperforms HyperPrompt (He et al., 2022b) when using a matching evaluation setup 4 , as seen in 2.
4 He et al. (2022b) do not release their code, so we are limited to comparing against the numbers they report.
However, we note that in table 3 our approach underperforms when using the original T5 model as the underlying model.As T5 was originally pretrained with a mix of self-supervised span-infilling and supervised tasks (including GLUE), this suggests the Hyperformer is able to effectively adapt the model to tasks seen or similar to those seen during pretraining.However, when we remove these tasks from the pretraining mixture (as was done for T5 v1.1+LM), the Hyperformer struggles to adapt the model, as seen in Table 1.Overall, this suggests being exposed to the underlying task in pretraining can make a large difference in the evaluation of parameter-efficient methods.As we wish to evaluate how well our approach can adapt models to completely unseen tasks, we restrict our underlying model to T5 v1.1+LM.

MRQA
We report our results on the MRQA dataset in Tables 4 and 5.We note that during experimentation, we found it beneficial to increase the encoder adapter size and reduce the decoder adapter size, keeping the overall parameter budget roughly the same (full results detailed in Appendix G).Overall, our approach outperforms other parameterefficient approaches and full finetuning (69.4 vs 69.9 overall average F1 between our method and full-finetuning).Gains are especially large in the out-of-domain test sets, likely due to the fact that freezing the underlying model preserves knowledge useful to these new domains.We also note our method performs especially well on long-context out-of-domain datasets (DuoRC, TextbookQA) suggesting it is especially effective at identifying when the question cannot be answered given a subsection of context.This shows our approach is still able to work well even when the underlying datasets are not revealed to the model in the prompt and that it can generalise well to out-ofdomain data.Additionally, our approach matches CA-MTL, which uses an underlying model with approximately 90 million extra parameters (BERTlarge vs T5 base ) and makes use of additional training data.Overall, this suggests our approach is able to generalise well to out-of-domain data, even when underlying datasets are not distinguished.

Summarisation & NLI
Finally, to explore a setting where strong negative interference is present, we experiment on a combi-nation of summarisation and NLI tasks following Aribandi et al. (2022).As shown in Table 6, we find that while no approach is able to match fullfinetuning in summarisation, both regular adapters and our approach are able to perform well for NLI.This suggests that while our approach is able to avoid some of the negative interference that results in lower NLI scores for the fully-finetuned model, it still struggles to overcome it.We note that similar to MRQA, we found here that placing a larger parameter budget into the encoder outperformed evenly splitting the budget between encoder and decoder (see Appendix H for details).

Analysis
This section presents an analysis of the Hyperdecoder's embeddings and ablations, establishing the efficacy of our architecture design choices.

Hypernetwork Embeddings
We visualise the embeddings learnt by the encoder before being passed to the hypernetwork ('e' in Equation 8) by utilising dimensionality reduction with t-SNE (van der Maaten and Hinton, 2008) and PCA (Pearson, 1901).Although Figure 2 suggests the hypernetwork does customise to different datasets to a degree, Figure 3 shows the most  salient difference between samples is the predicted label.Note that following Karimi Mahabadi et al.
(2021) we train our models to output numeric rather than text labels, meaning that while the labels may have semantic differences between datasets, the actual output from the decoder is identical between datasets.This suggests the hypernetwork has learnt to map from embedding space to the text labels, and the encoder is doing much of the classification work.We further investigate this by training simple linear classifiers for all datasets apart from STS-B on top of the T5 encoder (with learnt adapters) in Table 7.We note that we can recover much of the performance of our model in this case, suggesting that the decoder is largely working to map from represented space to text label space.Furthermore, the efficacy of our model at long-context out-ofdomain datasets for MRQA further suggests that the hypernetwork can effectively control the decoder to output specific labels when needed, but can flexibly swap to generate arbitrary text output when needed, unlike a simple linear classifier.This is especially important for multi-tasking and longcontext documents where the model must swap between generating short set labels and arbitrary longer text.A visualisation of the hypernetwork embeddings generated for NewsQA in Figure 4 further shows that empty string answers are generally clustered together.

Ablations
Placement of adapter and parameter generators We investigate alternate adapter generation possibilities by varying regular, task, and encoderconditioned adapters independently in the encoder and decoder while keeping the total number of trainable parameters roughly constant.We use the same setup as our previous GLUE experiments.
As seen in Table 8, input-conditioned or taskconditioned adapters do not perform as well as regular adapters in the encoder.The task hypernetwork struggles to learn useful adapters at all, while the encoder-conditioned adapters perform better but still do not match directly learnt adapters, likely due to the regular adapters being more effectively able to share knowledge or simply being easier to optimise.However, it seems much easier to learn to generate adapters for the decoder, with both task and encoder-conditioned adapters performing well.Our approach, using regular adapters in the encoder and encoder-conditioned adapters in the decoder, performs best overall.Other Elements We also investigate removing the MLP used in Equation 8and using layer normalisation outputs as input to the adapters ('postlayernorm input').The results in Table 9 suggest the MLP provides some utility and using inputs pre-layer normalisation works better.
6 Related Work

Multi-task Models
Neural networks have long been known to uncover and make use of task relatedness when training across multiple tasks (Caruana, 1997).Applications of this approach in NLP initially have usually involved generating shared representations and passing these to task-specific layers (Collobert and Weston, 2008;Liu et al., 2019).Newer approaches have opted to use a single set of parameters for all tasks, achieved by casting them to unified formats (Raffel et al., 2020).This allows massive multi-tasking approaches where extremely large models are trained across a wide variety of tasks (Aghajanyan et al., 2021;Aribandi et al., 2022), often with benefits to few or zero-shot performance (Sanh et al., 2022;Wei et al., 2022).This requires finetuning large models for long amounts of time over a large number of tasks, which may be out of reach with a limited compute budget.

Parameter-efficient Tuning
Numerous parameter-efficient approaches to finetuning large models have been proposed, including adapters (Houlsby et al., 2019), prefix-tuning (Li and Liang, 2021), prompt-tuning (Lester et al., 2021), and p-tuning (Liu et al., 2021a,b), all of which involve learning a small set of parameters in carefully chosen locations to achieve performance close to fully-finetuning the model.Recent studies have shown the effectiveness of these methods can be increased with differing placements (Pfeiffer et al., 2021;He et al., 2022a) and that parameters learnt for one task or language can be combined to allow better performance across a wide variety of tasks or languages (Pfeiffer et al., 2021(Pfeiffer et al., , 2020)).
Outside of sequence-to-sequence-based approaches, Pilault et al. (2021) find modifying multiple parts of BERT to be conditional on a task embedding to be effective for sequence classification tasks.Üstün et al. (2020) and Ansell et al. (2021) also explore generating multilingual adapters for mBERT using linguistic typological features, the former using parameter generators in every layer.

Conclusion
We propose a novel method for generating adapters conditioned on a model's input and show that this improves performance in multi-task settings across a variety of tasks.We explore the effectiveness of our approach for sequence classification, QA, and summarisation tasks, and find that it often outperforms strong parameter-efficient baselines.Future work could examine applying our approach to other architectures (e.g.decoder-only models) or explore the tradeoffs between shared and generated parameters across different layers.An analysis of our approach suggests the primary benefits come from improved control of the encoder over the decoder, enhancing the effects of positive transfer from the shared encoder.This allows our approach to efficiently adapt a pretrained language model to multiple tasks unseen during pretraining while still benefiting strongly from positive transfer.

Limitations
Our work explores a novel idea within parameterefficient finetuning by conditioning a model on itself, enabling greater flexibility for multi-tasking while adding relatively few parameters.However, this flexibility has limits: some tasks are more difficult to adapt with our method than others, as seen in the summarisation results in section 4.3.Examining other parameter efficient training methods such as LoRA (Hu et al., 2022) or prefix-tuning (Li and Liang, 2021) on top of adapters may yield further improvements (He et al., 2022a), but may also uncover a dependence on the particular adapter setup used by our approach.Additionally, parts of our design involve compressing information (in particular, the mean-pooling used to condition the hypernetwork), and further experiments on tasks with long inputs or outputs (such as summarisation) may reveal potential limitations of this approach and suggest further improvements.Additionally, our work only examines English-based tasks, so the ability of the model to handle alternate or potentially multiple languages at once is unknown.We note that existing work has shown adapters and hypernetworks to be useful for multilingual adaptation (Pfeiffer et al., 2020;Üstün et al., 2020;Ansell et al., 2021), suggesting that our approach may be effective for multilingual multitasking when combined with these existing multilingual adap-tation methods.Finally, the nature of the hypernetwork approach means that it may be difficult to scale to massive multi-tasking on the scale of exT5 (Aribandi et al., 2022) or T0 (Sanh et al., 2022), as the hypernetwork itself has limited capacity to store tasks.We do not investigate how this approach scales with respect to tasks, instead focussing on how the hypernetwork improves positive transfer and mitigates negative transfer for tasks whose positive and negative transfer effects are well-known.

A Baseline Details
A.1 Single Adapter Our single adapter baseline involves placing parallel adapters (He et al., 2022a) in each layer of the encoder and decoder and directly learning the parameters.For most tasks, we make use of adapters with size 410 in order to keep the trainable parameter budget equivalent to our approach.For MRQA, we achieve better results with adapters of size 800 in the encoder and size 36 in the decoder.

A.2 Task Hypernet
This model uses two hypernetworks to produce adapters for the encoder and decoder respectively, based on a learnt task embedding.The hypernetwork layout and adapter placement are identical to the one proposed in our method, with the exception that the initial input is a task embedding and not the mean-pooled encoder output.The inclusion of this baseline shows that the benefits of our method come from the use of input-specific decoders instead of improved adapter placement.For this approach, we use adapters of size 50 and hypernetworks of size 100, keeping the overall number of trainable parameters close to our approach.

A.3 Modular Task Hypernetwork
This is the method proposed by Ponti et al. (2022), which learns a mapping from task id to a sparse set of skills, each of which corresponds to a set of LoRA (Hu et al., 2022) parameters.These parameters are then combined to create task-specific adaptations.We set the number of skills |S| = 4 and use a two-speed learning rate as suggested by the authors.We set the main learning rate as 3e − 4, and the secondary learning rate for Z as 1e − 2. We use the same learning rate schedule as detailed in section 3.2.The rank of the LoRA adaptations is set to 16, following the defaults used in Ponti et al. (2022).

B Training Details
We run experiments on a single NVIDIA A100 80GB GPU, with all reported results from one training run using the provided hyperparameters.A single training run (including preprocessing and all evaluation) of our approach on the GLUE benchmark (with T5 large v1.1 + LM as the underlying model) takes 18 hours.A single training run of our approach on the MRQA datasets (with T5 base v1.1 + LM as the underlying model), including final evaluation, takes 13 hours.A training single run of our model on the Summarisation and NLI datasets (with T5 base v1.1 + LM as the underlying model), including evaluation, takes 22.5 hours.We note that T5 base v1.0 (or 'vanilla') has roughly 220 million parameters total, T5 base v1.1 + LM has roughly 250 million parameters total, and T5 large v1.1 + LM has roughly 800 million parameters total.

C Encoder-Only Model Details
The encoder-only models used in Table 7 consist of a T5 encoder with adapters inserted as done for the adapter-only and our model.We train a unique linear layer per task that maps from the hypernetwork embedding to logits, which are passed through a softmax layer to make a final prediction.We initialise the T5 encoder with the trained adapter layers in encoders, using the checkpoints reported in Table 1.This encoder is kept frozen during training.To match the MLP used to generate the hypernetwork embeddings in equation 8, we place an identical MLP on top of the adapteraugmented T5 encoder and pass its output to the linear classifier.This MLP is trained along with the classifier.To train, we use the AdamW optimiser with a learning rate of 2e-5 for 3 epochs with linear learning rate warmup and decay with 500 warmup steps.

D Parameters
We calculate the number of parameters for each method as follows.In each case, l is the number of layers, d the hidden size of the model, a the adapter dimension, t the number of tasks, and b the hypernetwork bottleneck size.For simplicity, we will assume that the encoder and decoder adapter sizes are the same below, but it is straightforward to calculate the parameters used in the encoder and decoder separately and sum these for cases where the encoder and decoder sizes differ.We leave out bias parameters for simplicity.

D.1 Adapters
We only consider placing adapters in the feedforward module of the transformer layer.Every layer has one adapter consisting of two linear layers and a non-linearity, giving l(2ad + a + d) overall new parameters.

D.2 Task Hypernetwork
For the task-embedding-based hypernetwork approach, we generate all adapters with two hypernetworks, one for the encoder and one for the decoder.the main parameter costs come in the form of the final layer of the hypernetwork, which consists of four linear layers producing the weights and biases of the two linear layers making up the adapter, thus costing b(2ad + a + d).Given task embedding size e t and layer embedding size e l , the total cost of the hypernetwork is te t +le l +(e t +e l )b+b(2ad+a+d).This is then multiplied by two as we have hypernetworks for the encoder and decoder.

D.3 Encoder-conditioned Decoders
The cost of the decoder is similar to the task hypernetwork, except we add the cost of the MLP and use the hidden size of the model instead of the task embedding size: for the adapter and model size choices used in our work.The encoder costs the same as the adapters case: l(2ad + a + d).
Our overall cost is simply the sum of the two.

E Dataset Statistics
In Tables 10, 11 and 12 provide some summary statistics of each dataset used and split sizes.

F Full GLUE Ablation Results
In Table 13 we provide the full results from the ablations performed in Tables 8 and 9. Model names are in format <encoder type>-<decoder type>.'Generated' refers to encoder-conditioned adapters, 'manual' regular adapters, and 'task' task-conditioned adapters.

G MRQA Varied Encoder Results
In Tables 14 and 15 we provide results from experiments varying encoder and decoder sizes, as  mentioned in section 4.2.For the final results reported in Tables 4 and 5, we use an encoder size of 512 for adapters and the hyperdecoder.

H Full Summarisation and NLI Results
In Table 16 we provide a breakdown of the results reported in Table 6.

Figure 1 :
Figure1: An overview of our proposed approach, where a hypernetwork generates the adapters for the decoder in an encoder-decoder model.Given an input instance and task name, an encoder produces an embedding which is used to generate decoder adapter parameters using a hypernetwork.
Figure 2: t-SNE visualisation of GLUE validation set hypernetwork embeddings produced by our approach.

Figure 3 :Figure 4 :
Figure 3: PCA visualisation of GLUE validation set hypernetwork embeddings, coloured by predicted label.STS-B examples removed for simplicity, as it is cast to a 21-class classification task for T5.

Table 2 :
He et al. (2022b)a BERT model with several task-conditional modules and uses a novel Performance of models using T5 large v1.1 + LM as the base model on GLUE dev set splits, picking the best task performance across all checkpoints.Trainable parameters is % of the trainable parameters used compared to fully finetuning T5. * Results fromHe et al. (2022b).

Table 3 :
Performance of models using T5 base vanilla as the base model on GLUE test splits described in section 3.1.1.* Results reported by Karimi Mahabadi et al. (2021).

Table 4 :
F1 score of models on in-domain MRQA validation split.* results from Pilault et al. (2021), who do not report performance on individual datasets and train on additional data.All non-BERT models use T5 base v1.1 + LM.

Table 5 :
F1 score of models on out-of-domain MRQA validation split.All non-BERT models use T5 base v1.1 + LM.
Friedman et al. (2021) et al. (2021).**Task Hypernet uses an average of the learnt in-domain task embeddings to condition its adapters.† RACE was part of UnifiedQA's training data.

Table 6 :
Average performance across Summarisation and NLI datasets using T5 base v1.1 + LM.Summarisation performance is average of R2 scores, while NLI is average of accuracy scores.Overall average is the arithmetic mean of the two.

Table 8 :
Average GLUE benchmark performance across varied encoder/decoder adapter configurations using T5 large v1.1 + LM as the base model.Number of trainable parameters is kept similar across approaches apart from full-finetuning.More details given in Appendix F.

Table 9 :
GLUE performance over different ablations.

Table 10 :
Summary statistics of splits used when evaluating GLUE.

Table 11 :
Summary statistics of splits used when evaluating MRQA, before applying preprocessing.

Table 13 :
Performance of models using T5 large v1.1 + LM as the base model on GLUE test set splits described in section 3.1.1.Trainable parameters is % of the trainable parameters used compared to fully finetuning T5.

Table 14 :
F1 score of models on in-domain MRQA validation split.All models use the T5 base v1.1 + LM checkpoint as a starting point.

Table 15 :
F1 score of models on out-of-domain MRQA validation split.All models use the T5 base v1.1 + LM checkpoint as a starting point.Daily Mail R2 Wiki Lingua R2 MNLI Ab.NLI Ad.NLI Sum.Avg.NLI Avg.Overall Avg.

Table 16 :
Performance across Summarisation and NLI datasets using T5 base v1.1 + LM.Summarisation scores are Rouge2 scores, while NLI scores are accuracy using test splits described in section 3.1.3.'Enc-heavy'variantsuse encoder adapter sizes of 512 and the matching decoder/hypernetwork sizes in Table14.