Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning

Recent works on instruction tuning (IT) have achieved great performance with zero-shot generalizability to unseen tasks. With additional context (e.g., task definition, examples) provided to models for fine-tuning, they achieved much higher performance than untuned models. Despite impressive performance gains, what models learn from IT remains understudied. In this work, we analyze how models utilize instructions during IT by comparing model training with altered vs. original instructions. Specifically, we create simplified task definitions by removing all semantic components and only leaving the output space information, and delusive examples that contain incorrect input-output mapping. Our experiments show that models trained on simplified task definition or delusive examples can achieve comparable performance to the ones trained on the original instructions and examples. Furthermore, we introduce a random baseline to perform zeroshot classification tasks, and find it achieves similar performance (42.6% exact-match) as IT does (43% exact-match) in low resource setting, while both methods outperform naive T5 significantly (30% per exact-match). Our analysis provides evidence that the impressive performance gain of current IT models can come from picking up superficial patterns, such as learning the output format and guessing. Our study highlights the urgent need for more reliable IT methods and evaluation.


Introduction
Recently, instruction tuning(IT) has drawn much attention in the NLP communities, with the rapid growth of new models (Sanh et al., 2021;Wei et al., 2021;Ouyang et al., 2022) and datasets (Wang et al., 2022;Gupta et al., 2022;Finlayson et al., 2022;Mishra et al., 2021;Ye et al., 2021;Bach et al., 2022). Models trained with task instructions demonstrate impressive zero-shot cross-task generalization ability. Despite the remarkable results, how models utilize the instructions during training and inference time remains an open question.
Prior works have raised the question of whether models really learn to follow the instructions or just capture spurious correlations. Jang et al. (2022), Webson and Pavlick (2021) showed that the current large language models (LLMs) can achieve similar performance with misleading instructions(prompts) in in-context learning(ICL) and few-shot learning scenarios. Min et al. (2022) analyze how model utilize examples in ICL. They observed that (1) Input-output mapping in examples is not important and(2) Output space information is crucial.
Besides ICL and few-shot prompt-tuning, some works raise concerns about instruction following in the instruction tuning field (Finlayson et al., 2022;Gupta et al., 2022;Gu et al., 2022), with a focus on test-time analysis. In contrast, we focus on analyzing how the models utilize instructions during the training process. We compare our analyzing methods and observation with prior works in Section 5.
In this work, we conduct controlled experiments on NatInst-V2 (Wang et al., 2022), the largest opensource instruction learning dataset includes 800+ English tasks with diverse task types, to study how models utilize instructions during IT. We strategically alter the instructions and compare them with original instructions for IT. Specifically, for task definition, we create simplified versions by removing all semantic components in the instructions and only leaving the output space information. For task examples, we create delusive examples with incorrect input-output mapping, where the examples' input and output spaces are correct, but the inputoutput mappings are wrong. Figure 1 demonstrates specific examples of these altered instructions.
Our experiments show that models trained with simplified task definitions achieve performances on par with the original IT models with different numbers of training examples ranging from 10 to 800 per task. We also observe that instruction-

Task(A) Empty Definition: ""
Remove everything in task definition except for output space information.
Entirely remove task definition.

Task(A) Empty Example: ""
Change the output in task example to wrong output.
Entirely remove task example. tuned models are sensitive to input-output mapping during the testing ICL stage, but not during the instruction-tuning (training) stage, especially in low resource settings (i.e., ≤ 50 training instance per task). To further understand why instruction tuning improves performance for zero-shot test tasks, we establish a random baseline that only knows the correct output format(label space) for classification and multi-choice tasks. We discover that the random baseline can get 40% absolute accuracy improvement over an untuned model, almost comparable to IT, which brings 48% absolute accuracy gains. Our results suggest that the impressive performance gains of IT may just come from models learning superficial patterns, such as the output space and format. We suggest future research on IT more carefully analyze their performance gains and benchmark against trivial baselines.

Analysis Method
Background. Instruction tuning aims to train models to follow instructions and achieve better zeroshot generalization to new tasks. Figure 1 illustrates a two-stage instruction tuning pipeline used in many IT models, such as T0 (Sanh et al., 2021), FLAN (Wei et al., 2021), and TK-Instruct (Wang et al., 2022). In the first stage, the models are trained on a set of training tasks with instructions (task-definition and task-examples). After training, the models are evaluated on a set of unseen testing tasks for zero-shot generalizability. By incorporating instructions during training, the models are shown to significantly improve performance over untuned models. The impressive performance gains led people to believe that models learned to follow instructions via instruction tuning. The goal of our analysis is to verify this belief. Task definition manipulation.
To analyze whether models really "understand" and utilize the semantic meaning of task definitions, we conduct controlled experiments to remove semantic information in task definitions. Specifically, we conduct instruction-tuning with task definitions at 3 levels of granularity: Original, Simplified, and Empty. The Original version uses human-crafted human-readable task definitions provided in NatInst-V2 (Wang et al., 2022). The Simplified task definitions remove all semantic components in the original task definition and only leave the output space information. Specifically, we only provide possible output labels as task definitions for classification tasks, and completely remove task definitions for other tasks (mostly generative tasks) during IT. Figure 1 shows an example of Simplified task definition. More details can be found in Appendix A.1. For Empty, we don't provide task definition during instruction-tuning. Task example manipulation. Finlayson et al. (2022) show that by providing a few task examples, both humans and models can guess and perform a task. We thus design a controlled experiment to study whether models learn the input-output mapping from task examples. Specifically, we compare models trained with 3 types sample negative examples from NatInst-V2, which have correct input and output formats, but incorrect input-output mappings.

Results
Task Definition Experiments. Figure 2 shows experimental results for task definitions. In the top sub-figures, we can see that the models trained with Simplified instructions achieve almost the same results as models trained with Original definitions both on Classification and Generative tasks. Note that Simplified task definitions remove all semantic components in task definitions and only retain output space information for Classification tasks and remove task definitions altogether for Generative tasks. This indicates that models may only utilize output space information during instruction tuning. The bottom-left sub-figure in Figure 2 shows the overall rouge-L score for classification tasks, where models trained on the Original task definition slightly outperform the Simplified ones. A closer examination reveals that models trained on the Original task definitions are more likely to predict partially correct answers that help with the ROUGE-L score in some tasks. We provide further details in Appendix A.4. Figure 3 shows the experimental results for task examples. The left sub-figure shows overall ROUGE-L scores. It shows that models trained with Delusive task examples can achieve almost the same performance as Original task examples when the number of training instances per task is small (≤ 50). When the data per task goes to 200, the Original models started to outperform Delusive ones slightly. Combined with the previous results for task definition, we observe that comparing to the untuned models(T5 w/o IT), the IT models may achieve significant performance gain (Rouge-L from 22 to 46) with (1)Simplified task definition and (2)Delusive task example, indicating that the current impressive improvement of IT models can come from the models learning superficial patterns without utilizing (following) the instructions like human do.

Task Examples Experiments.
For the right sub-figure, we shows the results using Delusive task examples during test time via in-context learning. We see the performance drops a lot for all three models, indicating that the inputoutput mapping is important for in-context learning on instruction-tuned models. This result seems to contradict with previous work (Min et al., 2022)'s, which found input-output mapping is unimportant for in context learning for classification tasks. However, a closer investigation found that most tasks suffer from significant performance drop are analogical tasks rather than classification tasks as studied in Min et al. (2022).

Discussion
Random baseline. While our experiments suggest that models do not utilize most information in the instructions, we still observe huge performance gains via instruction tuning. To understand where the gains come from, we introduce a Random baseline that simply guesses within the correct output space. Figure 4 shows the results. First, IT improves format correctness from 0.11% to 91% by training with only one instance per task, and the accuracy improves from 0.08% to 40%. Further providing more training instances per task(> 20) can improve accuracy to 48%. However, while the performance gains seem impressive, the Random baseline can also achieve 40% accuracy. This suggests that the majority of improvement from instruction tuning may come from model learning the output format and guessing. Related Analysis. Min et al. (2022) found inputoutput mapping in examples is irrelevant for incontext learning (ICL) on classification tasks. However, we observe that it matters to ICL but is irrelevant to IT training on analogical generative tasks. Webson and Pavlick (2021) analyzed prompt-based models in few-shot learning scenarios and observed that models learn as fast using irrelevant or misleading prompts, which aligned with our findings. For instruction tuning, prior works raised concerns about models not following instructions. Gu et al. (2022); Gupta et al. (2022) analyze how models utilize instructions by removing them during inference stages. However, they did not address how models use instructions during training. Wei et al. (2021); Wang et al. (2022) observe performance drop when removing task definition during IT and conclude that task definition is helpful, which we found true but only in terms of providing output space information.

Conclusion
In this work, we analyzed how models utilize instructions during IT using the current largest opensource IT dataset NatInst-V2(Wang et al., 2022). We constructed controlled experiments to compare model training with altered vs. original instructions (task definitions and examples). Our results suggest that the current IT models do not fully utilize instructions, and the impressive performance gains of IT may come from models learning superficial patterns, such as the output space and format. We suggest future research on instruction tuning more carefully analyze their performance gains and benchmark against trivial baselines. Also, we look forward to future works conducting further analysis and proposing more reliable IT methods.
While our analysis suggests that IT models do not fully utilize instructions but instead learn superficial patterns from instructions, there are some limitations to our experiments. First, we only analyze a SOTA IT method on the NatInst-V2 dataset. Though Wang et al. (2022) showed that their model can outperform other large models such as Instruct-GPT (Ouyang et al., 2022) and T0 (Sanh et al., 2021), we did not analyze other IT methods, such as RLHF (Reinforcement Learning from Human Feedback) in Instruct-GPT. Secondly, since our analysis is conducted in the training stage, we cannot analyze private models such as Chat-GPT. Also, we did not explore models larger than 770 million parameters due to our computation resource limitation. This may miss some emergent abilities of large language models (LLMs) (Wei et al., 2022). Lastly, while we observe the models do not utilize the majority of the instructions by IT, a certain degree of instruction understanding may already exist in pre-trained LLMs, which we did not study in this work. In conclusion, our work is a concentrated analysis to illuminate the potential vulnerability of the current IT models and evaluation metrics. We encourage future works to conduct more comprehensive studies on larger models and propose more reliable IT methods and evaluation frameworks.

Ethical Considerations
We will go through the computation resources and models we used to conduct our experiments. All of our models run on 4 48GB NVIDIA A6000 GPUs, along with 48 TB disk storage and AMD EPYC 7413 24-Core Processor. The experiment take around 1200 GPU hours for one 48GB NVIDIA A6000 GPU. Our experiments do not need to leverage model or data parallelism. For the model, we use Huggingface T5-large-lm-adapt models for our experiments, and will release our code once the paper been accepted.

A.1 Simplified Task Definition
To remove all semantic components and only leave the output space information within the task definition, we first manually look through all tasks to verify how each task definition describes their output space and further categorize all task definitions into four types: (1) Exact Mentioned, (2) Combined Mentioned, (3) Keyword Mentioned, and (4) No Mentioned. For Exact Mentioned, Combined Mentioned and Keyword Mentioned, there is a description of output space in the original task definition. For No Mentioned, The original task definition doesn't directly describe the labels or keywords in output space. This includes all the generative tasks and some classification tasks(We observe a few classification tasks in which task definitions do not describe output space information). Further details and examples are shown in Table 1.

A.2 Hyper-parameter tuning results
Before we conduct analysis, we follow the model settings in Wang et al. (2022) to perform the hyper-parameter search. Prior works trained the TK-Instruct(770M) models from T5-Large-lmadapt(770M) with a learning rate 1e-5, batch size 16, and 100 training instances per task for two epochs. We found out that (1) learning rate 1e-4 can converge faster while performance remains; (2) Higher batch size(≥ 128) leads to much lower loss and better performance; (3) more training instances per task(≥ 200) leads to better performance; and (4) the loss will converge with 4 to 6 epochs. Following the hyper-parameter search results, we conducted our experiment with the following setting: learning rate 1e-4, batch size 128, [10, 20, 50, 200*, 800] training instance per task, and trained for six epochs. Our best results(200 instances) achieve a 52.8 Rouge-L score, which is better than TK-Instruct-770M(48 Rouge-L) from Wang et al. (2022) and comparable to their TK-Instruct-3B(54 Rouge-L) model.

A.3 Analogical Tasks
We look into a set of models training with Original task examples and find out a list of tasks with the most performance drop(Drop more than 20% score) when using Delusive examples during testing(incontext learning). We show the list of tasks in Table 3 and some of their details in Table 2. It is seen that these types of tasks have short input and output lengths, where input and output have direct word-level relations.

A.4 Performance gap between rouge-L and exact match
In the Results section, we observed that there's a slight performance gap on Classification tasks between model training with Original and Simplified task definition. By further examining the data, we observed that this could happen to some Keyword Mentioned tasks we described in Appendix A.1. Table 1 shows the example tasks in Keyword Mentioned. This task is a 7-class classification task with a special label "REFERENCE". The ground truth with "REFERENCE" will be combined with other text in the input, and both Original and Simplified models struggles(0% exact match) to predict the correct answer for this class. However, while both models failed to predict exactly correct answers, we observed that the Original model could achieve better partially correct answers by simply predicting more "REFERENCE". When we look into the testing set, we observe that 94 percent of ground truth is in "REFERENCE" class. Also, when we look into the predictions, we observe Original model will predict 55 percent of "REFER-ENCE" while Simplified only predicts 4 percent, achieving a 33.8 higher rouge-L score. We hypoth-esized that this happened because the word "reference" has explicitly been mentioned numerous times(8) in the Original task definition while mentioning other labels less than twice, leading to Original model's tendency to predict "REFERENCE".

Exact Mentioned Description
For tasks labeled as Exact Mentioned, the task definition describes the finite output space, which means all the labels within the output space are directly written in the definition. Original Definition Definition: In this task, you will be shown a short story with a beginning, two potential middles, and an ending. Your job is to choose the middle statement that makes the story incoherent / implausible by indicating 1 or2 in the output. If both sentences are plausible, pick the one that makes less sense. Output Space Finite Set: ["1", "2"] Simplified Definition "Label: 1. Label: 2." Combined Mentioned Description For tasks labeled as Combined Mentioned, the task definition describes a set of keyword labels that construct an infinite output space with all possible combinations of these keyword labels.

Original Definition
Given a command in a limited form of natural language, provide the correct sequence of actions that executes the command to thus navigate an agent in its environment.  Table 1: We describe how we created Simplified task definition from Original task definition for four task definition types: Exact Mentioned, Combined Mentioned, Keyword Mentioned, and No Mentioned. For each task definition type, Description describes how the task definition provides the output space information; Original Definition shows an example of a task definition within this definition type, which are all retrieved from real tasks in NatInst-V2 dataset; Output Space describes the set of the output space; Simplified Definition shows an example of how we simplified the Original Task Definition into the simplified version. task036_qasc_topic_word_to_generate_related_fact Task Definition In this task, you need to write a topic word from the given fact. The topic word must have at least one word overlap with the given fact. The topic word often involves adding a new word from a related concept.
In your topic word, use at least one word from the given fact. Topic words with two or more words work best. Task Example Input: Fact: pesticides cause pollution. Output: pollution harms. task1152_bard_analogical_reasoning_causation Task Definition Two analogies that relate actions with their consequences are given in the form "A : B. C : ?". The phrase "A : B" relates action A to consequence B. Your task is to replace the question mark (?) with the appropriate consquence of the given action C, following the "A : B" relation. Your answer should be a single verb, without further explanation. Task Example Input: throw : fly. aspire : ? Output: attain task1159_bard_analogical_reasoning_containers Task Definition Two analogies that relate items to the associated containers is given in the form "A : B. C : ?". "A : B" relates item A to its associated container B. Your task is to replace the question mark (?) with the appropriate container for the given item C, following the "A : B" relation.