Differentiable Instruction Optimization for Cross-Task Generalization

Instruction tuning has been attracting much attention to achieve generalization ability across a wide variety of tasks. Although various types of instructions have been manually created for instruction tuning, it is still unclear what kind of instruction is optimal to obtain cross-task generalization ability. This work presents instruction optimization, which optimizes training instructions with respect to generalization ability. Rather than manually tuning instructions, we introduce learnable instructions and optimize them with gradient descent by leveraging bilevel optimization. Experimental results show that the learned instruction enhances the diversity of instructions and improves the generalization ability compared to using only manually created instructions.


Introduction
Recently, significant progress has been made in developing models that can generalize to arbitrary tasks by following natural language descriptions (Brown et al., 2020;Ouyang et al., 2022).Instruction tuning has been a region of interest as a training technique to obtain such generalization ability (Wei et al., 2022;Sanh et al., 2022;Mishra et al., 2022).By finetuning pretrained language models on a variety of tasks with their instructions, models can generalize to arbitrary tasks unseen during training.Many previous studies witnessed the effectiveness of instruction tuning (Chung et al., 2022;Wang et al., 2022;Lampinen et al., 2022).
Various instructions have been created for instruction tuning, such as task name, task definition, positive/negative exemplars of a task, explanations of why each positive/negative exemplar is correct/incorrect, etc.However, Mishra et al. (2022); Wang et al. (2022) showed that the definition and positive exemplars of tasks are sufficient for instruction tuning, and the effect of adding other types of instruction is negligible or sometimes has a Answer the sentiment of the text.This film is fun.
Translate the text into Japanese.

Learnable instruction 𝝓
This film is fun.

Learnable instruction 𝝓
I am so tired.
How are you?I like sports. Positive. Negative.

metatrain metatest
Paraphrase the following text.
I like fruits.
It tastes good.
I love fruits.
It is delicious.negative impact on the generalization performance.
Seeking an optimal instruction for cross-task generalization is an important issue for instruction tuning, while it requires much human effort (100+ researchers have participated in previous studies).Furthermore, human-interpretable instructions are not necessarily optimal for obtaining cross-task generalization ability.
Against this background, we propose instruction optimization, which introduces learnable instructions and optimizes them w.r.t. the cross-task generalization ability.As shown in Figure 1, a model θ is optimized to maximize the performance on meta-train tasks following learnable instructions.By contrast, learnable instructions ϕ are trained to maximize the meta-test performance of the trained model θ * (ϕ).This optimization is called bilevel optimization and is frequently used in hyperparameter optimization (Franceschi et al., 2017;Lorraine et al., 2020), meta-learning (Finn et al., 2017;Franceschi et al., 2018), and neural architecture search (Liu et al., 2018;Zhang et al., 2021).We regard training instructions as a special type of hyper-parameter and optimize them with gradient descent by relaxing the search space to be continuous.
To create learnable instructions, we propose two methods: instruction embedder, which generates the embeddings of instructions, and instruction extractor, which selects an optimal task exemplar.Recently, prompt engineering has drawn attention to seek the optimal prompt to achieve a task (Liu et al., 2022b).Some work studies continuous prompts that perform prompting in the embedding space of tokens (Li and Liang, 2021;Lester et al., 2021), whereas others retrieve optimal exemplars as a testing prompt for in-context learning (Liu et al., 2022a;Rubin et al., 2022).Our instruction embedder and instruction extractor follow the idea of continuous prompts and prompt retrievers, respectively.Whereas previous work optimizes prompts to solve an individual task on the test, our study differs in the target and aim of optimization.We optimize the training prompts to maximize the cross-task generalization ability of the trained model.
In the experiment, we confirmed that the instruction extractor successfully extracted appropriate instruction, providing proof of concept.Regarding the comparison with instruction tuning, the instruction embedder enhances the diversity of instructions and improves the generalization ability compared to using only manually created instructions.In contrast, the instruction extractor does not contributes to the performance gain, which shows that using the same task exemplar across instances is unexpectedly preferable for cross-task generalization.This study provides a basis for exploring the optimal instructions for instruction tuning.

Preliminaries
Instruction tuning trains a model θ to minimize the training loss defined in Eq. ( 1): where t and I t denote the embedding matrix of the i-th input and instruction of the task t, respectively.y t is a sequence of tokens that represents a class label or reference text.Instruction tuning regards all tasks as the conditional text generation given the concatenation of the instruction and task input [I t ; X t ].By prepending the instruction to the task input, the trained model θ * can generalize to a variety of unseen tasks t / ∈ T train .The optimal training instructions have been sought by manually creating various types of instruction for instruction tuning (Mishra et al., 2022;Wei et al., 2022;Sanh et al., 2022).However, Mishra et al. (2022); Wang et al. (2022) showed that task definition and task exemplars are sufficient for instruction tuning, while adding other types of instruction is negligible or sometimes negatively affects the generalization performance.This observation motivates us to automatically optimize training instructions, rather than manually tuning them.We introduce learnable instructions and optimize them with gradient descent by leveraging bilevel optimization.The next section provides the details of instruction optimization.

Instruction Optimization
Instruction optimization splits training tasks T train into two sets: meta-train tasks T meta−train and meta-test tasks T meta−test .Subsequently, a model θ is trained to minimize the inner loss on meta-train tasks following learnable instructions I ϕ in Eq. (2).
where ϕ is a parameter for learnable instructions.I ϕ is constructed using an instruction embedder (Section 3.1) or an instruction extractor (Section 3.2), which will be explained later.
If the learnable instruction I ϕ is randomly created, the trained model θ * (ϕ) performs poorly on unseen tasks.Therefore, we optimize ϕ such that the trained model θ * (ϕ) achieves high performance on meta-test tasks, which are not shown during training.ϕ is updated to minimize the outer loss in Eq. (3).
This optimization is called bilevel optimization and is commonly used in hyperparameter optimization.
Figure 2: Outline of instruction embedder and instruction extractor.Instruction tuning uses a manually created instruction or randomly selected exemplar as training instruction.In contrast, instruction embedder introduces the learnable embeddings of instruction, while instruction extractor selects an optimal exemplar as training instruction.
Note that we use the manually created instruction I t to measure the meta-test performance because we aim to develop a model that can accept arbitrary human-created instructions.

Instruction Embedder
This section presents a method for creating learnable instructions I ϕ .As shown in Figure 2 (left), the instruction embedder replaces manually created instructions with the embeddings of learnable instructions or prepends them to manually created instructions.We consider the following two types of parameterizations of learnable instructions: Direct Parameterization (DP) We parameterize the learnable instruction I ϕ by preparing a learnable matrix for each task: I ϕ = W t ∈ R l×d where l denotes the arbitrary length of a learnable instruction, and d is the dimension of the embeddings in the model θ.Although this parameterization is very simple, the size of the parameter ϕ (|T train |×l ×d) increases when many training tasks exist.Moreover, as each learnable matrix W t is updated only when task t is used for computing the meta-train loss, the matrices are updated infrequently when the number of training task is large.Therefore, we propose another parameterization method that is scalable for a large number of training tasks.
Instance Conversion (IC) Another parameterization method is to convert a task instance z t into I ϕ as shown in Eq. ( 4) and (5).
where the task instance z t is a sequence of tokens defined as "Input: t represents the i-th input and output of a task t, respectively.V ϕ ∈ R v×d ′ is an word embedding matrix where v denotes the vocabulary size, and avgpool denotes the average-pooling operation across the embedded tokens.h t , and W ϕ ∈ R l×d×d ′ is a learnable tensor to convert the latent representation into an instruction1 .We assume that V ϕ and W ϕ are optimized to generate an optimal instruction given a task instance.As the parameters are shared across all training tasks, this parameterization is scalable for a large number of training tasks.

Instruction Extractor
We consider another type of instruxction that has multiple candidates to use.A task exemplar is one example because every task instance j ∈ {1, . . ., N t } in the training set can be used as a task exemplar.While instruction tuning randomly selects a task exemplar as instruction, an optimal task exemplar would exist for cross-task generalization.We explore how to select the optimal task exemplar that maximizes the performance on unseen tasks.An outline of the instruction extractor is shown in Figure 2 (right).
We parameterize the probability p ϕ (z , where the j-th instance is selected as an exemplar of task t.Similar to the instruction embedder, we consider the following two parameterizations: Direct Parameterization (DP) We parameterize the logits of p ϕ (z (j) t ) by using a learnable vector v t ∈ R Nt for each task t.The logits are converted into probabilities using softmax function in Eq. ( 6).
This parameterization is simple but not scalable when the number of training tasks is large.
Instance Conversion (IC) While direct parameterization parameterizes p ϕ (z (j) t ) regardless of the task instance (i.e., task input and output), instance conversion considers the conditional probability given a task instance.Specifically, instance conversion parameterizes the probability where z (j) t is selected as the exemplar of instance z (i) t in Eq. ( 7).
where W ϕ ∈ R d ′ ×d ′ denotes a learnable matrix, and h t obtained by Eq. ( 4).This parameterization assumes that V ϕ and W ϕ are optimized to select an optimal exemplar given a task instance.As the parameters ϕ are shared across all training tasks, this parameterization is also scalable for a large number of training tasks.
Subsequently, an instance with the highest probability is extracted as an instruction as shown in Eq. ( 8) and (9).
where V θ ∈ R v×d is the word embedding matrix of the model θ.Since argmax operation is not differentiable, we use the straight-through estimator (Bengio et al., 2013) to approximate the gradient in the backward pass 2 .As computing the probability of all instances requires a high computational cost when the number of instances is significant, we set a constant value as N t = N and randomly sampled N instances from all training instances.
2 We also tried to compute I ϕ using the expectation of z (j)

Efficiently Solving Bilevel Optimization
Directly solving bilevel optimization requires a substantial computational cost because it includes a nested formulation.As shown in Alg. 1, approximating the inner optimization in Eq. ( 2) by Kgradient steps significantly reduces the computational cost, where K is large enough to reach the optimal points of the inner-loop (Franceschi et al., 2017;Shaban et al., 2019).
Computing the hypergradient ∇ ϕ L out (θ (K) ) still requires large memory space O(K|θ| + |ϕ|) as it needs to store K-step gradients (Franceschi et al., 2017), and the language model θ contains a lot of parameters.Using the implicit function theorem in Eq. ( 10) and ( 11), the hypergradient can be computed without storing the intermediate gradients (Bengio, 2000;Lorraine et al., 2020).
However, it is impractical to compute the inverse of the Hessian matrix in Eq. ( 11) as exactly inverting Hessian often requires O(|θ| 3 ) computational cost.We thus approximate the inverse-Hessian using the Neumann approximation, which is introduced in the hyperparameter optimization (Lorraine et al., 2020;Zhang et al., 2021).The inverse of the Hessian matrix can be approximated as shown in Eq. ( 12).

∂Lin(θ, ϕ)
∂θ∂θ where E denotes an identity matrix.γ ∈ R is sufficiently small to satisfy ∥E−γ ∂L in (θ,ϕ) ∂θ∂θ ∥ < 1 in the operator norm.Consequently, the computational cost of the hypergradient considerably decreases to O(|θ|+|ϕ|) as shown in Lorraine et al. (2020).Dataset In this experiment, we used SUPER-NATURALINSTRUCTIONS (SUP-NATINST; Wang et al., 2022) as a benchmark to measure cross-task generalization.SUP-NATINST consists of over 1,600 diverse tasks and their instructions across multiple languages.We used English tasks and their instructions, resulting in 876 tasks in total.We used the same test split of tasks (12 types; 119 tasks) and 100 instances for each task as Wang et al. (2022).The remaining 60 task types (757 tasks) were used for meta-train, meta-test, and validation.The validation set consisted of 10 instances across all 757 tasks, which were used to determine hyperparameters including meta-train/test split.Based on the validation performance, we split the 60 task types into 50 and 10 types, which were used for the meta-train and meta-test set, respectively.We used 100 instances of each task for the meta-train/test set.Table 1 summarizes the statistics for each split.The task types in each split are listed in Appendix A.1.

Evaluation & Baselines
We assessed the crosstask generalization in two settings: a zero-shot setting that uses task definition as testing instruction, and a one-shot setting that uses a task exemplar (n=1) as testing instruction.We adopted ROUGE-L (Lin, 2004) to evaluate all tasks.Wang et al. (2022) shows that the human evaluation results align quite well with ROUGE-L across a variety of tasks.
For baseline training instructions, we used manually created instructions (e.g., task definition), exemplars randomly selected for each task or each instance.Learnable instructions induced by the instruction embedder or optimal exemplars selected by the instruction extractor were compared.
Implementation Details In our experiment, we used pretrained T5 (Raffel et al., 2020) as the model θ.Specifically, we use the LM-adapted version of the original T5-base (220M)4 , which is further trained with a language modeling objective (Lester et al., 2021).The hyperparameters of model θ were tuned based on the validation performance of instruction tuning (baselines), and the same hyperparameters were used for instruction optimization.The hyperparemters of learnable instructions ϕ were determined w.r.t. the validation performance of instruction optimization.Further details are provided in Appendix A.2.

Proof of Concept
Before moving on to the comparison with instruction tuning, we show that our instruction extractor successfully optimizes the training instruction.We trained models with two types of training instructions: one of which is a task exemplar, and the other is a blank text.Then, we evaluated them on the test set, where a task exemplar is used as the testing instruction.As shown in Figure 3 (left), the model trained with a task exemplar achieves nearly 40% ROUGE-L (black), whereas the model trained with blank text significantly declines to approximately 20% ROUGE-L (gray).Following these preliminary results, we verified that our instruction extractor appropriately selects a task exemplar from the two training instructions and obtains sufficient generalization ability.Figure 3 (left) shows that our instruction extractor achieves competitive performance with the model trained with a task exemplar.Specifically, the instance conversion (IC; blue) converges faster than the direct parameterization (DP; light blue).Figure 3 (right) presents the percentage of training instances where a task exemplar is selected as the training instruction.Regarding the DP, the percentage increases smoothly, whereas it saturates at approximately 50%.In contrast, the IC reaches almost 100%, though the increase is slightly unstable.These results indicate that our instruction extractor successfully selects an appropriate training instruction.Note that the training time of instruction optimization is reasonable compared to instruction tuning, as shown in Appendix A.3.

Main Results
Here, we examine the effectiveness of instruction optimization by comparing it with the baselines.In Table 2 and 3, we show the average performance across 8 different random seeds and 95% confidence intervals w.r.t. the t-distribution.
Table 2 shows the average ROUGE-L across all test tasks where the task definition is used as the testing instruction, while varying the training instruction.As the baseline of training instruc- tions, we used manually created task definitions concatenated with positive/negative exemplars and explanations about each positive/negative exemplar.When using only learnable instructions generated by the instruction embedder, the performance is considerably worse than that of baselines.This underperformance suggests that the learned instructions cannot alternate with manually created instructions.However, concatenating learnable instruction with task definition leads to performance gain, whereas prepending other instructions (positive/negative exemplars and explanations) has a negative effect.As will be elaborated in Section 5.1, adding learnable instructions improves the diversity of instructions and achieves higher generalization performance.
In Table 3, we show the results where a task exemplar is used as the testing instruction.Unfortunately, our instruction extractor underperforms exemplars randomly selected for each task (i.e., the same exemplar is used for each instance).To investigate the reason for the worse performance, we added another baseline, which randomly selects an exemplar for each instance (i.e., different exemplars are used for each instance).Unexpectedly, the random exemplars yield considerably worse ROUGE-L when they are selected for each instance.This result indicates that using the same exemplar across all instances of each task is preferable for cross-task generalization.As the instruction extractor (DP and IC) updates the optimal exemplar during the optimization, it performs worse than exemplars randomly selected for each task.In particular, as IC varies the optimal exemplar for each instance, it results in a lower performance.
The evaluation results of each test task type are shown in Appendix A.4.

Analysis of Learned Instruction
We discuss how the learned instruction contributes to the improvement of cross-task generalization.
As the instruction embedder directly generates instruction embeddings in a continuous space, the learned instruction is difficult to interpret.Following Lester et al. (2021), we computed the nearest neighbors of each token in the learned instruction from the vocabulary of the model θ; however, we could not find explicit patterns for the nearest tokens.Therefore, we computed the embeddings of the learned instructions and visuzalized them at a two-dimensional space using t-SNE (Van der Maaten and Hinton, 2008).The embeddings were obtained by the average pooling across the last hidden states encoded by the T5 encoder.
In Figure 4, we show the embeddings of top 20 task types with respect to the number of tasks in the meta-train set.The embeddings of the task definition (left) are closely clustered by the task type, and training tasks do not cover some spaces.On the other hand, the embeddings of learned instructions (right) are roughly clustered, and some task types are scattered over the embedding space (e.g., sentiment analysis and toxic language detection).As learned instructions enhance the diversity of instructions and cover a broader embedding space, the trained model can generalize to wider variety of instructions.Thus, learned instructions improve the generalization performance on unseen tasks.
Figure 5 shows the generalization performance concerning the length of the learnable instruction prepended to the task definition.The model's performance saturates when the length is 2 6 = 64.When the instruction is longer than 64, the perfor-  mance declines significantly.As bilevel optimization tends to be unstable for large-scale hyperparameters, a large instruction length leads to low generalization performance.

Analysis of Meta-train/test Split
We study how meta-train/test split affects the generalization performance of the trained model.
Number of Meta-train/test Tasks Figure 6 shows the performance with different numbers of task types in the meta-train/test split: 1/59, 10/50, 20/40, 30/30, 40/20, 50/10, and 59/1.In each split, meta-train/test tasks were randomly chosen.The trained model achieves the best generalization performance when the number of categories in the meta-test is 10.The performance worsens as the number of meta-test tasks increases, while the number of meta-train tasks decreases correspondingly.
Diverse vs.Not Diverse We examine whether meta-test tasks should be diverse or not diverse.If meta-test tasks are diverse, the generalization performance would be improved because the instruction is trained to achieve higher performance on various tasks.However, it also increases the risk that some of meta-test tasks are similar to metatrain tasks, which would negatively affect the performance on unseen tasks.It is not obvious whether meta-test tasks should be diverse or not diverse.
To answer this question, we prepared two types of meta-test splits.One comprises randomly selected tasks, whereas the other consists of tasks that are grouped by k-means clustering.We prepared 16 different random splits, while k-means divided the tasks into 16 groups based on the embeddings of the task definition.Then, for both random split and k-means, the best split for the validation set was chosen from the 16 splits.Experimental results show that model trained on the random split achieves 36.1 ROUGE-L, while that of k-means scores 35.0 ROUGE-L on the test set.Although the margin is not significant, we confirmed that diverse meta-test tasks are more preferable for cross-task generalization.

Related Work
Instruction Tuning Instruction tuning has attracted considerable attention to achieve models that are generalizable across a variety of tasks (Wei et al., 2022;Sanh et al., 2022;Mishra et al., 2022).By prepending either a few exemplars (Min et al., 2022b;Chen et al., 2022) or text-based instructions (Wei et al., 2022;Sanh et al., 2022;Mishra et al., 2022) to multi-task learning, the trained model can generalize to tasks unseen during training.Further progress has been made by scaling the number of tasks (Wang et al., 2022;Chung et al., 2022), scaling the model size (Chung et al., 2022;Scao et al., 2022), and improving the training strategy (Lang et al., 2022;Min et al., 2022a;Ye et al., 2023).In contrast, our work is the first study to optimize training instructions to improve the cross-task gen-eralization ability.
Although SUPER-NATURALINSTRUCTIONS (Wang et al., 2022) is used as the benchmark for measuring cross-task generalization in our study, our instruction optimization can be applied to other cross-task benchmarks, such as CROSSFIT (Ye et al., 2021) and PromptSource (Bach et al., 2022).
Prompt Engineering Recent instruction-based NLP has evolved prompt engineering, which seeks the most appropriate prompt to achieve a task (Liu et al., 2022b).While there are numerous studies to search for an optimal prompt in a discrete token space (Shin et al., 2020;Schick and Schütze, 2021;Gao et al., 2021), some work studies continuous prompts that perform prompting in the embedding space of tokens (Li and Liang, 2021;Lester et al., 2021;Qin and Eisner, 2021).Other studies retrieve appropriate exemplars as a testing prompt for incontext learning and achieve better performance than randomly selected exemplars (Das et al., 2021;Liu et al., 2022a;Rubin et al., 2022).Whereas the aforementioned methods optimize prompts to achieve an individual task in the test, our study differs in the target and aim of optimization; we optimize the training prompts to maximize the generalization performance of the trained model.
Bilevel Optimization Bilevel optimization has been used to optimize hyperparameters (Franceschi et al., 2017;Lorraine et al., 2020), initial model weights (Finn et al., 2017;Franceschi et al., 2018), and model architectures (Liu et al., 2018;Zhang et al., 2021).We optimize the training instructions by regarding them as a special type of hyperparameters.Learnable instructions are constructed by many hyperparameters, which makes bilevel optimization difficult in terms of computational cost and stability.Recent studies (Rajeswaran et al., 2019;Lorraine et al., 2020;Zhang et al., 2021) significantly reduce the computational cost and improve the stability by combining the implicit function theorem with efficient inverse Hessian approximations.We leverage this idea for instruction optimization, achieving instruction optimization at a reasonable computational cost and stability.

Conclusion
This study presents instruction optimization, which optimizes training instructions concerning generalization ability.The experimental results showed that our instruction extractor successfully extracted appropriate instruction, providing proof of concept.Regarding the comparison with instruction tuning, the instruction embedder enhanced the diversity of instructions and improved the generalization ability than using only manually created instructions.In contrast, the instruction extractor did not contribute to the performance gain because using the same task exemplar across instances is unexpectedly preferable for cross-task generalization.This study provides a basis for exploring the optimal instructions for instruction tuning.

Limitations
Our study used T5-base (220M) due to the capacity of our computational resources (Tesla V100 32GB).Thus, it is unclear whether our method is also effective for larger models, such as T5-XL/XXL.Lester et al. (2021) argues that continuous prompts are particularly effective for large T5 models.Following their results, our instruction embedder is also expected to be effective for larger models.
As shown in Figure 3, instruction optimization is slightly unstable to converge.Some studies tackled the unstable convergence of bilevel optimization by L2-normalization, early stopping (Zela et al., 2019), or perturbation of hyperparameters (Chen and Hsieh, 2020).These methods might be effective in stabilizing the instruction optimization.

Figure 3 :
Figure 3: Left: ROUGE-L on test tasks where a task exemplar is used as testing instruction, while training instruction is varied as above.Right: the percentage of training instances where a task exemplar is used as training instruction.

Figure 4 :
Figure 4: Embeddings of the instructions in the meta-train set.Left: task definition; Right: learned instruction concatenated with task definition.Each point represents a task, while each color denotes the task type.

Figure 5 :
Figure 5: ROUGE-L on the test set where the length of learnable instructions is varied.

Figure 6 :
Figure 6: ROUGE-L on the test set w.r.t. the number of task types in the meta-test set.

Table 1 :
Statistics of the dataset.