Learning to Generate Task-Specific Adapters from Task Description

Pre-trained text-to-text transformers such as BART have achieved impressive performance across a range of NLP tasks. Recent study further shows that they can learn to generalize to novel tasks, by including task descriptions as part of the source sequence and training the model with (source, target) examples. At test time, these fine-tuned models can make inferences on new tasks using the new task descriptions as part of the input. However, this approach has potential limitations, as the model learns to solve individual (source, target) examples (i.e., at the instance level), instead of learning to solve tasks by taking all examples within a task as a whole (i.e., at the task level). To this end, we introduce Hypter, a framework that improves text-to-text transformer’s generalization ability to unseen tasks by training a hypernetwork to generate task-specific, light-weight adapters from task descriptions. Experiments on ZEST dataset and a synthetic SQuAD dataset demonstrate that Hypter improves upon fine-tuning baselines. Notably, when using BART-Large as the main network, Hypter brings 11.3% comparative improvement on ZEST dataset.


Introduction
Pre-trained text-to-text models (Raffel et al., 2020;Lewis et al., 2020) provide a unified formulation and off-the-shelf weights for a variety of NLP tasks, such as question answering (Khashabi et al., 2020) and commonsense reasoning (Bosselut et al., 2019). In addition to their strong performance, text-totext models naturally support generalizing to novel tasks, by incorporating task description as part of the source sequence and fine-tuning the model with (source, target) examples (Weller et al., 2020). At inference time, the model is required to perform 1 Code and data can be found at https://github.com/ INK-USC/hypter. (a) Zero-shot Learning from Task Description, ZEST dataset (Weller et al., 2020) (b) Synthetic Version of SQuAD, (Rajpurkar et al., 2016) Figure 1: Instead of learning from (source, target) examples, in this paper we study the problem of learning from task descriptions (Weller et al., 2020). The train set contains M tasks, and the i-th task contains N i examples of (s, t) pairs in text format. During test time, the learned model is required to directly make inferences on a new task given a task description.
unseen tasks with the source sequence containing new task descriptions. While this initial attempt shows positive results, there are two potential limitations for the direct finetuning approach. (1) Predictions can be sensitive to the task descriptions (or "prompts") that are heuristically designed (Jiang et al., 2020). Paraphrasing the task description may lead to performance downgrade.
(2) The model still learns from individual (source, target) examples, instead of learning to solve tasks at a higher level, by explicitly taking multiple examples within a task as a whole (see Fig. 1). Meanwhile, applying existing zero-shot learning methods that supports task-level learning to text-to-text transformers is non-trivial. Methods designed specifically for classification problems, such as prototypical networks (Snell et al., 2017), cannot be directly applied to text-to-text models. Moreover, given the large size of text-to-text models, generating parameters for a whole model from the task description (Jin et al., 2020) is infeasible.
In this work, we follow the settings in (Weller et al., 2020) and aim to improve a model's generalization ability to unseen tasks by better incorporating task descriptions and using a task-level training procedure. We introduce HYPTER, a frame-  Left: A hypernetwork generates parameter φ i for task-specific adapter i that is plugged to transformer layer i in the text-to-text model. Right: The adapted main network is evaluated on a task (d, D). The final cross entropy loss is back-propagated to update the hypernetwork.
work that employs a hypernetwork (Ha et al., 2017) to dynamically generate task-specific parameters (i.e., adapters) from task descriptions. Adapters (Houlsby et al., 2019) are light-weight modules that can be inserted into transformer layers for parameter-efficient adaptation. Such formulation also effectively enables learning at the task level, by learning to generate appropriate parameters for a task, and examine the model's competence on each task using multiple examples within that task. This is in contrast to learning at the instance level, by learning to generate the correct output for one specific input sequence.
We apply HYPTER to two datasets: ZEST (Weller et al., 2020) and a synthetic version of SQuAD (Rajpurkar et al., 2016). We demonstrate that HYPTER improves upon direct fine-tuning baselines. Notably, training with HYPTER achieves 0.45% absolute improvement (11.3% comparative improvement) in Competence@90 metric on ZEST, when BART-Large is used as the main network.

Problem Definition
We study the problem of learning from task description (Weller et al., 2020), and aim to improve models' competence on unseen tasks at the inference time. Formally, a task is denoted as a tuple of (d, D), where d is the natural language description of the task, and D = {(s 1 , t 1 ), ..., (s n , t n )} contains (source, target) examples of this task (See Fig.  1). In our text-to-text formulation, both s i and t i are text sequences. At train time, both d and D are available, while at test time, an unseen description d is given, and the model is expected to predict the correct t given input s without further training.
For instance, in the ZEST dataset (Weller et al., 2020), a train task description can be "Are mountain bikes allowed at this national park?", while D contains twenty paragraphs for different national parks and twenty corresponding answers. During test time, a novel task may be "Are there fish in this national park that live in caves?", and the model is asked to directly make inferences.

Background: Adapters
Our work is built on adapters (Houlsby et al., 2019), light-weight modules that can be placed into transformer layers for parameter-efficient transfer learning. In the original paper, the main model is frozen during training, while only layer norm and adapter parameters are learnable. In this paper, we adopt a simplified design compared to the original paper (see Fig. 2 (Left)) -In each transformer layer, exactly one adapter module will be added after the multi-headed attention. One adapter module contains two linear layers separated by an nonlinearity activation layer. We use (W id , b id ) to denote the down-projection parameters for the adapter in transformer layer i, and (W iu , b iu ) for the upprojection parameters.

Method
Overview. Fig. 2 provides an illustration of our HYPTER framework. HYPTER has two major parts: (1) A main network, which is a pre-trained text-totext model. We instantiate the main network with BART-Base/Large (Lewis et al., 2020). (2) A hyper-network, which generates adapters to be plugged into the main network. Fig. 2 (Left) contains a detailed illustration of how adapter parameters are generated and how adapter layers are incorporated into one transformer layer.
Hypernetwork. The hypernetwork consists of an encoder and multiple decoders. The encoder maps the task description d to a latent representation h 0 , while the decoders use h 0 to generate adapter parameters φ. In our work we instantiated the encoder with a RoBERTa-Base model (Liu et al., 2019), i.e., h 0 = RoBERTa(d). For a textto-text model with n transformer layers, the hypernetwork contains n decoders. Decoder i uses h 0 as input, and outputs adapter parameters φ i for trans- are trainable parameters. The generated parameters φ i are sliced and reshaped to become parameters Model Training. We adopt a training schedule where we first train the main network, then train the hypernetwork while the main network is frozen. Conceptually, the first stage ensures that the main network captures the general ability across different tasks; the second stage allows the hypernetwork to learn to adapt the main network to a specific task. During the first stage the text-to-text model is fine-tuned with all (Concat(d, s), t) examples in the training set. Here Concat(d, s) means the concatenation of task description d and input s. The learned main network from this stage also serves as the baseline method.
During the second stage, we sample a task (d, D) from the training set and sample a mini-batch of (s, t) examples from D. Given a description d, the hypernetwork generates adapter parameters φ i . We insert the resulting adapter layers into the main network, and compute the cross entropy loss L of generating t given input Concat(d, s). The loss is end-to-end differentiable and is back-propagated to update the hypernetwork, while the main network is frozen. See Fig. 2 (Right) for illustration. This second stage of training effectively enables learning at the task level. The loss L characterizes the model's competence in the task (d, D). Therefore, by optimizing L, the model is trained to solve tasks.
Model Inference. At test time the model is given an unseen task description d. The hypernetwork generates description-dependent adapter parame-ters, similar to the procedure during training. In this way, we obtain a model that is capable of making inferences for this new task.

Experiment Setup
Datasets. We use two datasets that fit our setup. The first one is Zero-shot Learning from Task Descriptions dataset (ZEST, Weller et al. 2020), which formulates task descriptions as generalized questions, and provides multiple source-target examples for each question. The performance is evaluated with a novel metric: "Competence@K", along with mean F1 score. Competence@K is the percentage of all tasks for which the model achieves mean F1 score higher than K. For example, Com-petence@90=5 suggests that 5% of all tasks can be solved with mean F1 better than 90%. We report dev set performance, and hidden test set performance obtained from ZEST's official leaderboard.
We construct the second dataset from SQuAD v1 (Rajpurkar et al., 2016) to simulate the problem setting in this paper. We refer to this dataset as Synthetic SQuAD. Specifically, we construct tasks from the original SQuAD train set according to "question type", the bi-gram containing the central question word (e.g., what, when). For example, "when does" questions are considered as a task, and "what country" questions are considered as another task. These bi-grams are used as "task descriptions". We select the 100 most frequent question types in SQuAD train set, and randomly subsample 64 examples from each type to formulate our dataset. We then randomly split the 100 types into 80/10/10 for train/dev/test. In addition, we select examples that fall into the 10 test question types from Natural Questions (Kwiatkowski et al., 2019) and NewsQA (Trischler et al., 2017), and use these as out-of-domain test examples. Performance is evaluated with mean F1. We include the list of question types and more details about this dataset in Appendix A.
Baseline. To demonstrate the efficacy of the HYPTER framework, we compare it to just its first half -the main text-to-text transformer model that we obtain after the first stage of training. This is identical to the fine-tuning baseline method in (Weller et al., 2020), and there are no other applicable baselines to the best of our knowledge.  Training Details. For each method, we train the model 7 times using different random seeds, and we report average and standard deviation. We discuss other training details, including hyperparameters, in Appendix B. Notably, we ensure all baseline models will not benefit from additional training, by tuning the number of epochs and using early stopping based on dev performance. This ensures the improvement brought by HYPTER is not due to additional training.

Results
Main Results. We present the results for ZEST in Table 1-2 and results for Synthetic SQuAD in Table 3. On ZEST test set, we observe that the Competence@90 metric is improved from 3.98 to 4.43 when using BART-Large, yielding an 11.3% relative improvement. When BART-Base is used, C@90 is improved from 2.23 to 2.53. This demonstrates that by learning to solve tasks with HYPTER, the model's generalization ability to unseen tasks is improved. On Synthetic SQuAD dataset, we observe 0.74% improvement with BART-Base and 0.41% improvement with BART-Large. Additionally, models trained with HYPTER achieves comparable or better performance on out-of-domain test sets, suggesting the learned task-solving ability is generalizable to new test distribution. 3 It is a known issue that evaluating zero-shot performance can be tricky. We tried our best to reduce the ran-   domness and instability by using different random seeds. In Table 1 and Table 3, we demonstrate that performance improvement is significant (p<0.05) in multiple settings, e.g., on ZEST dev set when C@75 metric is used.
Model Behavior Analysis on ZEST. ZEST dataset provides a comprehensive analysis protocol by splitting tasks into different generalization types (base, paraphrase, composition, semantic flips, and output structure) and defining four error types (recall, precision, partial, and other). Compared to the BART-Large fine-tuning baseline, our model achieves better performance in "base" and "paraphrase" categories in the ZEST official test set. We also manually inspected dev set predictions produced by the baseline and our model. We found the predictions corrected by our method span across the four error types. In particular, the proposed method flipped two "n/a" predictions into the correct answers in the task "Which royalty was this dog breed popular with?" ("base" category), reducing the recall errors and improving the competence metric. We do not observe more granular model behavioral patterns beyond this point.
Study of Data Efficiency. We study whether HYPTER is effective when trained with (1) fewer tasks, while the number of examples per task is unchanged; (2) fewer examples per task, while the number of total tasks is kept constant. We experiment with ZEST and BART-Large, and show the performance in Fig. 3. We observe that HYPTER is effective when trained with 75%/100% tasks, but does not improve performance with fewer tasks. This is reasonable since HYPTER learns at the task level (taking one task as an "example"), and 50% of the tasks may be insufficient. We also observe performance improvement with 75%/100% examples per task, but not with fewer examples. This suggests sufficient number of examples per task is necessary for HYPTER to generate effective adapters.

Related Work
Zero-shot Learning with Transformers. Zeroshot learning (ZSL) has been explored for various NLP tasks, including text classification (Yin et al., 2019), entity linking (Logeswaran et al., 2019) and entity typing (Obeidat et al., 2019). Several works study cross-task transfer by unifying the inputoutput format, e.g., relation extraction as machine reading comprehension (Levy et al., 2017), named entity recognition as machine reading comprehension . Such formulation allows generalization to unseen relation or named entity types at test time. Learning from task descriptions (Weller et al., 2020) and instructions (Mishra et al., 2021) can be considered as a sub-category in zeroshot learning, with the goal of generalizing to unseen tasks during inference.
Adapters for Transformers. Houlsby et al.
(2019) proposed adapter layers for parameterefficient transfer learning in NLP. Adapter layers, which adopt a bottleneck architecture with two linear layers, are added after each multi-headed attention layer and each feed-foward layer in a pretrained transformer. Adapters have been recently applied to multi-lingual settings, with successes in NER, QA and commonsense reasoning (Pfeiffer et al., 2020;Philip et al., 2020;Artetxe et al., 2020).

Hypernetworks and Contextual Parameter
Generators. Hypernetwork (Ha et al., 2017) is a broad concept of "using one network to generate the weights for another network". This concept has been broadly applied to visual reasoning (Perez et al., 2018), zero-shot image classification (Jin et al., 2020), etc. Closely related to our work, UDapter (Üstün et al., 2020) studies multilingual dependency parsing by generating adapter parameters. Our work is more generalizable as we do not restrict task format (dependency parsing v.s. general text-to-text tasks) or relations between subtasks (cross-lingual tasks v.s. tasks with text-form descriptions).

Conclusion
In this paper, we introduced HYPTER, a framework to improve text-to-text transformer's generalization ability to unseen tasks. HYPTER enhances taskspecific abilities by inserting adapters generated with a hypernetwork, meanwhile it maintains the model's general task-solving ability by freezing main model parameters. We demonstrated the effectiveness of HYPTER on two datasets. Future work may explore teaching models with compositional instructions using HYPTER, or propose robust fine-tuning methods that help the model generalize to unseen data. It is also necessary to construct a large dataset of diverse NLP tasks to facilitate future research in this direction. For hypernetwork training, we train up to 100 epochs (one epoch here refers to an iteration over all tasks). We update the hypernetwork every b tasks, and we call b as task batch size. When learning from one task, we sample b ′ examples within this task, and we call b ′ as the example batch size. We greedily and sequentially select adapter width d from {4,8,16,32}, learning rate α from {3e-6, 1e-5, 3e-5, 1e-4}, b from {4,8,16,32}, b ′ from {4,8,16,32}, based on dev set performance.

C Additional Baseline
Another reasonable baseline is to fine-tune a textto-text model together with randomly initialized adapters plugged in it. We experiment with this method using BART-Large and list the performance in Table 4. We do not observe significant differences between the two methods (p=0.8840 for C@75, p=0.8118 for C@90 in two-tailed paired t-test).

D Dev Set Performance of Models Submitted to ZEST Leaderboard
In Table 5 we present the dev performance of models submitted to the leaderboard. The submitted models are the "first-runs" in the 7-run series, as we add the 7-run experiments and significance test later on, following a reviewer's suggestion.

E Discussion
It is worth noting that the efficacy of HYPTER is at the cost of introducing new parameters in the hypernetwork. To generate adapter parameters, more parameters are introduced and trained in the hypernetwork. One may achieve better generalization ability to unseen tasks with larger pre-trained models with billions of parameters. In this case, we consider HYPTER as an alternative by augmenting a medium-sized pre-trained model with a hypernetwork. Meanwhile, we highlight our contribution to be the concept of generating task-specific adapters from descriptions and HYPTER's task-level training procedure.