Symbol tuning improves in-context learning in language models

We present symbol tuning - finetuning language models on in-context input-label pairs where natural language labels (e.g.,"positive/negative sentiment") are replaced with arbitrary symbols (e.g.,"foo/bar"). Symbol tuning leverages the intuition that when a model cannot use instructions or natural language labels to figure out a task, it must instead do so by learning the input-label mappings. We experiment with symbol tuning across Flan-PaLM models up to 540B parameters and observe benefits across various settings. First, symbol tuning boosts performance on unseen in-context learning tasks and is much more robust to underspecified prompts, such as those without instructions or without natural language labels. Second, symbol-tuned models are much stronger at algorithmic reasoning tasks, with up to 18.2% better performance on the List Functions benchmark and up to 15.3% better performance on the Simple Turing Concepts benchmark. Finally, symbol-tuned models show large improvements in following flipped-labels presented in-context, meaning that they are more capable of using in-context information to override prior semantic knowledge.


INTRODUCTION
A key feature of human intelligence is that humans can learn to perform new tasks by reasoning using only a few examples. Scaling up language models has unlocked a range of new applications and paradigms in machine learning, including the ability to perform challenging reasoning tasks via few-shot examples given in-context (Brown et al., 2020;OpenAI, 2023, inter alia). Language models, however, are still sensitive to the way that prompts are given, indicating that they are not reasoning in a robust manner. For instance, language models often require heavy prompt engineering (Brown et al., 2020;Reynolds & McDonell, 2021) or phrasing tasks as instructions (Wei et al., 2022a;Ouyang et al., 2022;Sanh et al., 2022, inter alia), and they exhibit unexpected behaviors such as performance on tasks being unaffected even when shown in-context exemplars with random labels (Min et al., 2022b) or flipped labels .
In this paper, we propose a simple finetuning procedure that we call symbol tuning, which significantly improves the ability of language models to reason with and learn from input-label mappings presented in-context. In the symbol-tuning procedure, we finetune language models on input-label pairs presented in-context where natural language labels are remapped to arbitrary symbols. 1 The intuition is that when models cannot rely on instructions or relevant natural language labels to figure out a given task, it must instead do so by reasoning with input-label mappings in-context in order to learn the mappings that reveal the task. We perform symbol tuning using a mixture of 22 NLP datasets with various arbitrary symbols as labels and experiment using several Flan-PaLM models (Chung et al., 2022, 8B, 62B, 62B-cont, 540B).
First, symbol tuning improves performance of baseline models on unseen in-context learning tasks across various settings (with/without instructions, with/without relevant labels), with larger performance gains when instructions or natural language labels are not given in the prompt. For example, when prompts do not contain instructions or relevant labels, symbol tuning yields a +11.1% average performance improvement across eleven evaluation tasks for Flan-cont-PaLM-62B.
Second, symbol-tuned models are better at algorithmic reasoning tasks, a striking result since symbol tuning only includes natural language data and did not have any numerical or algorithmic data. On a set of reasoning evaluation suites for list functions (e.g., remove the last element in a list), symbol-tuned models experience performance improvements of +18.2% for Flan-PaLM-8B, +11.1% for Flan-PaLM-62B, and +3.6% for Flan-PaLM-540B. On a set of turing concept tasks (e.g., swapping 0s and 1s in a string), symbol-tuned models also improve by +15.3% for Flan-PaLM-8B and Flan-PaLM-62B and +4.7% for Flan-PaLM-540B.
Additionally, we experiment on an in-context learning setting where inputs have flipped labels, which forces the model to override its prior knowledge when presented with contradictory information in-context. Pretrained language models have the ability to somewhat follow flipped labels-this ability is lost during instruction tuning but can be restored via symbol tuning.
Finally, we conduct ablation studies demonstrating that symbol tuning is simple to implement and only requires a relatively-small amount of compute. Symbol tuning does not require mixing instructiontuning data or collecting a large number of datasets, and only 1k to 2k steps of tuning are needed to get its benefits. Overall, we hope that the strong empirical results from symbol tuning encourage further work in allowing language models to reason over arbitrary symbols given in-context.

SYMBOL TUNING
Despite their ability to perform some reasoning tasks after being shown in-context exemplars OpenAI, 2023), language models are still sensitive to the way in which these tasks are presented in prompts (Brown et al., 2020;Reynolds & McDonell, 2021;Wei et al., 2022a), suggesting that they are not reasoning in a robust way. Instruction tuning has been shown to improve performance and allow models to better follow in-context exemplars Min et al., 2022a;Wei et al., 2022a;Ye et al., 2021;. One shortcoming, however, is that models are not forced to learn to use the exemplars because the task is redundantly defined in the evaluation example via instructions and natural language labels. For example, in the left-hand side of Figure 1, although the exemplars can help the model understand the task, they are not strictly necessary since the model could ignore the exemplars and just read the instruction.
To make the model better at in-context learning, we propose symbol tuning, in which the model is finetuned on exemplars where the instructions are removed and natural language labels are replaced with semantically-unrelated labels (e.g., "Foo," " Bar," etc.). In this setup, the task is unclear without looking at the in-context exemplars. For example, if the prompt from the previous paragraph was changed to "<sentence>. Answer: {Foo, Bar}" (as shown in the right-hand side of Figure 1), multiple in-context exemplars would be needed in order to figure out the task. Because symbol tuning teaches the model to reason over the in-context exemplars, symbol-tuned models should have much better performance on unseen tasks that require reasoning between in-context exemplars and their labels. Figure 2 shows the 22 publicly-available NLP datasets from HuggingFace (Lhoest et al., 2021) (see Appendix B.1 for dataset details) that we use for our symbol-tuning procedure (we ablate the number of datasets used for symbol tuning in Section 7.3). We selected NLP tasks that have been widely used in the literature (Wang et al., 2018;. Each dataset is categorized into one of seven task types-we only selected classification-type tasks because symbol tuning requires discrete labels. For each dataset, we use examples from the training split to compose prompts that we use for tuning. Each prompt uses a randomly-selected input-label format (formats are shown in Appendix C.2) and contains a randomly-selected number between 2 and 10 of in-context exemplars per class. We remap labels to a randomly-selected label from a set of ∼30k labels from three label types as shown in Figure 3 (we ablate the number of labels in Appendix A.6 and the label types in Appendix A.7). Examples of generated tuning prompts for each task are shown in Appendix E.1.
We generate prompts for the four different in-context learning (ICL) settings described in Figure 4; each setting either contains or does not contain instructions describing the task (see Appendix B.2 for the instructions we use for each task) and does or does not contain relevant natural language labels. For settings that do not use relevant natural language labels, we remap original labels to a randomly-selected label from a set of approximately 270k semantically-unrelated labels as shown in Figure 3 (we removed labels that were seen during symbol tuning). Examples of generated evaluation prompts for each task are shown in Appendix E.2.

Words
(MIT list of 10,000 words )

Words
(MIT list of 100,000 words)

Integers (5 digits)
Finetuning (~30k symbols) Evaluation (~270k symbols) Figure 3: We use a set of ∼300k arbitrary symbols from three categories (integers, character combinations, and words). ∼30k symbols are used during tuning and the rest are held out for evaluation. See Appendix C.1 for more details on the symbols that we used.

MODELS & FINETUNING PROCEDURE
For our experiments, we tune Flan-PaLM , the instruction-tuned variants of PaLM . We use instruction-tuned variants in order to reduce the number of steps needed for tuning, since symbol tuning an instruction-tuned model does not require relearning the information learned during the original round of instruction tuning. We use three different sizes of Flan-PaLM models: Flan-PaLM-8B, Flan-PaLM-62B, and Flan-PaLM-540B. We also tested Flan-cont-PaLM-62B , PaLM-62B at 1.3T tokens instead of 780B tokens), which we abbreviate as 62B-c.
Our symbol-tuning pipeline mixes all datasets and randomly samples from each dataset. To ensure that the dataset sizes are balanced (i.e., no dataset gets completely overshadowed), we limit the number of training examples per dataset to a maximum of 25k randomly-selected examples. Training examples are combined into a single sequence using packing (Raffel et al., 2020), and inputs are separated from labels using an end-of-sequence (EOS) token. We tune all models using a batch size of 32 and the Adafactor optimizer (Shazeer & Stern, 2018). For 8B and 62B models, we tune with a learning rate of 3 × 10 −3 , and we tune Flan-PaLM-540B with a learning rate of 1 × 10 −3 . We use 2048 and 512, respectively, as the input and target sequence lengths during tuning.
Symbol tuning for 1k steps on a TPUv4 (Jouppi et al., 2023) requires approximately 16 minutes with 64 chips for Flan-PaLM-8B, 70 minutes with 128 chips for Flan-PaLM-62B, and 6 hours with 512 chips for Flan-PaLM-540B. For 8B and 62B model evaluations, we report results from the checkpoint after tuning for 4k steps, and for 540B model evaluations, we report results from the checkpoint after tuning for 1k steps (we ablate the number of tuning steps in Section 7.1). See Appendix C.3 for the number of finetuning steps, learning rate, batch size, and dropout used for each model. As a baseline, we compare our symbol-tuned models against the instruction-tuned models from , and we also compare symbol tuning against continued instruction tuning in Appendix A.1.

SYMBOL-TUNED MODELS ARE BETTER IN-CONTEXT LEARNERS
In the symbol-tuning procedure, models must learn to reason with in-context exemplars in order to successfully perform tasks because prompts are modified to ensure that tasks cannot simply be learned from natural language labels or instructions. Symbol-tuned models should thus perform better in settings where tasks are unclear and require reasoning between in-context exemplars and their labels. Additionally, since symbol tuning is meant to improve the ability to follow in-context exemplars, it should not modify prior knowledge and should thus retain the same performance in settings where exemplars are not as necessary to complete the task.
To explore these settings, we define four ICL settings that vary the amount of reasoning required between inputs and labels in order to learn the task (based on the availability of instructions/relevant labels), as shown in Figure 4. The easiest of these settings uses prompts where both instructions and relevant labels are available (as in-context exemplars are not necessary to learn the task), while the hardest setting uses prompts where instructions and relevant labels are both unavailable. Figure 4: Depending on the availability of instructions and relevant natural language labels, models may need to do varying amounts of reasoning with in-context exemplars. When these features are not available, models must reason with the given in-context exemplars in order to successfully perform the task. When they are available, reasoning with exemplars can help but is not necessary.
In Table 1, we evaluate model performance before and after symbol tuning in each of these settings. We find that symbol tuning improves performance across all ICL settings for models 62B and larger, with small improvements in settings with relevant natural language labels (+0.8% to +4.2%) and substantial improvements in settings without relevant natural language labels (+5.5% to +15.5%). Strikingly, when relevant labels are unavailable, symbol-tuned Flan-PaLM-8B outperforms Flan-PaLM-62B, and symbol-tuned Flan-PaLM-62B outperforms Flan-PaLM-540B. This performance difference suggests that symbol tuning can allow much smaller models to perform as well as large models on learning input-label mapping from exemplars (effectively saving ∼10x inference compute).
Symbol-tuned models also perform somewhat-comparably in settings with only relevant labels or only instructions, unlike baseline models whose performance in settings with only relevant labels is always better than in settings with only instructions. Performance in settings with relevant labels actually decreases for Flan-PaLM-8B after symbol-tuning, however, which may suggest that symbol tuning a small model can override its prior knowledge due to overfitting. Overall, the improvements demonstrate the strong potential of symbol tuning to improve model performance, especially when tasks are not clear and require learning from in-context exemplars.

Average performance on eleven tasks
Relevant labels:

SYMBOL TUNING IMPROVES ALGORITHMIC REASONING
Symbol tuning is designed to force the model to learn from input-label mappings in the in-context exemplars because the symbols are unrelated to the task and no instructions are provided (and thus the model cannot rely on any other guidance to determine the task). For this reason, we posit that symbol tuning should not only improve the model's ability to map natural language inputs to arbitrary symbols, but also its ability to learn other forms of inputs-label mappings such as algorithms.
To test this, we experiment on algorithmic reasoning tasks from BIG-Bench (Srivastava et al., 2022). We first experiment on a set of list function tasks Srivastava et al., 2022) where the model needs to identify a transformation function (e.g., remove the last element in a list) between input and output lists containing non-negative integers. These tasks were evaluated in a four-shot setting, following our evaluation setup in Section 3.2. Additionally, we test models on a set of simple turing concepts (Telle et al., 2019;Srivastava et al., 2022) where models need to reason with binary strings to learn the concept that maps an input to an output (e.g., swapping 0s and 1s in a string). These tasks have predetermined shots for each evaluation example. We selected these algorithmic tasks because they test the model's ability to generalize to different task types (the symbol-tuning tasks were classification problems with discrete labels, while these tasks are more open-ended generation problems) and do not require world knowledge (symbol tuning does not increase prior knowledge).
In Figure 5, we show model performance on the twenty list function tasks with the highest human accuracy baselines 2 (Rule, 2020) separated into five categories (category details are described in Appendix D.1) and the turing concepts containing 3 or fewer instructions in the AS II subset of the simple turing concepts task. On the list function tasks, symbol tuning results in an average performance improvement across all tasks of 18.2% for Flan-PaLM-8B, 11.1% for Flan-PaLM-62B, 15.5% for Flan-cont-PaLM-62B, and 3.6% for Flan-PaLM-540B. On the turing concept tasks, symbol tuning results in a performance improvement of 15.3% for Flan-PaLM-8B and Flan-PaLM-62B, 14.1% for Flan-cont-PaLM-62B, and 4.7% for Flan-PaLM-540B. Flan-cont-PaLM-62B with symbol tuning outperforms Flan-PaLM-540B on the list function tasks (in terms of average accuracy across tasks), which is equal to a ∼10x reduction in inference compute. These improvements on an unseen task type suggest that symbol tuning indeed strengthens the model's ability to learn in-context, as the symbol-tuning procedure did not include any algorithmic data and only used natural language data.  Srivastava et al., 2022). (F): simple turing concepts task (Telle et al., 2019;Srivastava et al., 2022). Accuracy per list function category is averaged across all subtasks (categories and per-task results are shown in Appendix D.1).  showed that while pretrained language models (without instruction tuning) could, to some extent, follow flipped labels presented in-context, instruction tuning degraded this ability. Symbol tuning, on the other hand, forces models to consider the label presented in-context as an arbitrary symbol, which should reduce the model's usage of prior knowledge that contradicts the flipped labels. For this reason, we expect that symbol tuning would be able to improve and restore the ability to follow flipped labels in-context.

SYMBOL-TUNED MODELS CAN OVERRIDE PRIORS VIA FLIPPED LABELS
To test this, we flip the labels of both in-context exemplars and the evaluation example for the tasks described in Section 3.2 (we remove tasks with more than two labels from this experiment since it is unclear how to best "flip" more than two labels). For example, for the SST2 dataset, all exemplars that are labeled as having "positive" sentiment will now be labeled as having "negative" sentiment. A perfect model that can follow these flipped labels should achieve 100% accuracy on these tasks if its accuracy on the standard in-context learning setting is also 100%.
As shown in Figure 6, symbol tuning restores the ability to follow flipped labels that was lost during instruction tuning. We see that there is a similar trend across all model sizes-instruction-tuned models are generally unable to follow flipped labels (as demonstrated by their performance being far below random guessing), but symbol-tuned models are much more capable of doing so. We found that after symbol tuning, Flan-PaLM-8B sees an average improvement across all datasets of 26.5%, Flan-PaLM-62B sees an improvement of 33.7%, and Flan-PaLM-540B sees an improvement of 34.0%. For some datasets (e.g., OR, SUBJ, TC), symbol-tuned models can now override priors and follow flipped labels (i.e., achieve much better performance than random guessing), despite instruction-tuned models not being able to do so for any datasets. Additionally, symbol-tuned models achieve similar or better average performance as pretraining-only models, indicating that symbol tuning has, to some extent, restored the model's original ability to follow flipped labels.
These results further indicate another type of generalized in-context learning capability, as we did not include any flipped labels during symbol tuning. Although the performance improvement from symbol tuning is large, we note that more work should be done in this area since performance on the flipped-labels settings is, on average, not significantly better than random guessing.  Figure 6: Symbol-tuned models are much better at following flipped labels presented in-context than instruction-tuned models are for all model sizes. Instruction-tuned models cannot flip predictions to follow flipped labels (performance is well below random guessing), while symbol-tuned models can do this more often (performance matches or is slightly above random guessing). Ground-truth labels for evaluation examples are flipped, so if a model learns to follow flipped labels, its accuracy should be above random guessing (e.g., a perfectly-accurate model that can follow flipped labels should get 100% accuracy on our evaluations).

NUMBER OF TUNING STEPS
A question that may come to mind is how many steps of finetuning is needed to get the benefits of symbol tuning. In particular,  performed instruction tuning on PaLM models for 40k steps for PaLM-8B and PaLM-62B, 21k steps for PaLM-540B, and 60k steps for cont-PaLM-62B, so it is unclear if symbol tuning would require such extensive tuning. Intuitively, however, since our symbol-tuning dataset is much smaller than the tuning data from , symbol tuning should require fewer steps for finetuning than instruction tuning does. To analyze this, we examine model performance in each of the four ICL settings from Figure 4 with respect to the number of steps tuned. We train 8B and 62B models for up to 10k steps and 540B models for up to 5k steps, and we evaluate checkpoints every 1k steps on the same evaluation tasks and settings from Section 4.
We show these results in Figure 7. As expected, we see that symbol tuning does not require many steps of finetuning for any model. Moreover, the largest changes in performance occur within the first 1k to 2k steps of symbol tuning, after which model performance stays relatively constant. Flan-PaLM-540B also seems to experience performance drops in all settings after 1k steps, which may indicate that larger models require a more-diverse or larger set of symbol-tuning data. These results suggest that symbol tuning does not require extensive compute for exhaustive tuning.  Figure 4 with respect to the number of steps tuned. For many models, the most-significant changes in performance emerge after tuning for 1,000 to 2,000 steps, indicating that symbol tuning does not require large amounts of compute to be effective. Performance is shown as the average accuracy across eleven datasets.

MIXING INSTRUCTION-TUNING DATA
In Section 4, we found that small models may actually overfit to the symbol-tuning data, resulting in performance drops in ICL settings where relevant labels are available. One potential way of preventing this is to include instruction-tuning data during symbol tuning. Since instruction-tuning examples contain relevant labels and instructions that match a model's prior knowledge, they may help reinforce prior knowledge and prevent small models from "forgetting" their priors. We create several mixtures of instruction-tuning data and symbol-tuning data to test this idea. For each mixture, we use varying ratios of instruction-tuning data to symbol-tuning data (e.g., a mixture with 33.3% symbol-tuning data means that instruction-tuning data is weighted twice as heavily as symbol-tuning data). Our instruction-tuning data is directly taken from  and then mixed with our symbol-tuning data from Section 3.1.
We then tune models on these mixtures and evaluate their performance. 3 In Figure 8, we show model performance on the ICL settings from Section 4. We find that even a small mixture of symboltuning data (e.g., 16%) versus instruction-tuning data can significantly change model performance. Flan-cont-PaLM-62B Figure 8: Performance on the in-context learning settings from Figure 4 with respect to the percentage of the tuning-data mixture that is symbol-tuning data (the rest of the mixture is instruction-tuning data). Tuning mixtures comprise instruction-tuning data from  and symbol-tuning data (ours). For all models, only a small amount of symbol-tuning data is needed to improve model performance on many settings. Performance is shown as the average accuracy across eleven datasets. : Tuning models using mixtures with a higher proportion of symbol-tuning data results in better performance in the flipped label setting. Performance is shown using the average accuracy across the six datasets from Section 6. Furthermore, higher proportions of symbol-tuning data after this initial change generally do not significantly affect model performance. 4 These results indicate that, in terms of a model's ability to succeed in these ICL settings, the proportion of symbol-tuning data used is not important as long as some non-trivial amount of symbol-tuning data is used. As shown in Figure 9, however, the proportion of symbol-tuning data is much more impactful for succeeding in flipped-label settings. We find that there is a strong correlation between a higher mixture of symbol-tuning data and a model's ability to follow flipped labels, a trend that holds regardless of the size of the model. Combining this result with the trend shown in Figure 9, we propose using only symbol-tuning data as a default setting because it does not significantly decrease model performance (for large-enough models) and because a higher percentage of symbol-tuning data significantly improves the model's ability to override prior knowledge with in-context exemplars.

NUMBER OF TUNING DATASETS
The overall goal of symbol tuning is to teach models that any arbitrary label for an input-label mapping should be treated as a symbol to be learned. The symbol-tuning procedure should thus only be successful if a diverse-enough set of tasks are shown such that the model can learn to generalize its behavior to new tasks. To test this, we randomly remove a varying number of tasks from the mixture and retune models on these new mixtures. 5 We then evaluate these models on the ICL settings from Section 4.
We show these results in Figure 10. First, we see that as a general trend, using more datasets for symbol tuning improves performance. This effect seems to slightly plateau as more datasets are added, and 62B models benefit more from added datasets than the 8B model does. Second, we find that symbol tuning with a small number of datasets (e.g., only one or two datasets) can hurt performance Flan-cont-PaLM-62B Figure 10: Models perform better when the symbol tuning mixture includes more datasets, and symbol tuning with fewer datasets can produce models that perform well in ICL settings without relevant labels but worse in ICL settings with relevant labels. All models are tuned for 4k steps. Zero dataset represents Flan-PaLM model performance without any symbol tuning. Performance is shown as the average accuracy across eleven datasets.
in settings where relevant labels are available. For example, while symbol tuning using just one dataset can significantly improve performance in settings without relevant labels, it simultaneously decreases model performance in settings where relevant labels are available. These results imply that symbol tuning works best when a large variety of tasks are used, and symbol tuning with only a small number of tasks may result in models that perform worse in settings with relevant labels. Given these results, we note that future work may be needed to investigate the effects of scaling up the symbol-tuning procedure.

IN-CONTEXT LEARNING VIA SEMANTIC PRIOR KNOWLEDGE
Recent studies on in-context learning suggest that prior knowledge plays a significant role in how models learn in-context. For example,  showed that some small models and instruction-tuned models cannot follow flipped labels presented in-context, suggesting that these models primarily utilize prior knowledge for in-context learning. Min et al. (2022b) found a similar result that using random ground-truth labels in in-context exemplars does not significantly affect performance, meaning that performance may be driven by other factors such as the label space. Reynolds & McDonell (2021) also showed that cleverly-constructed prompts in a zero-shot setting could outperform prompts in a few-shot setting, implying that, for some tasks, models can achieve better performance by leveraging their existing knowledge than from attempting to learn the task from in-context exemplars. Additionally, in chain-of-thought prompting (Wei et al., 2022b), Madaan & Yazdanbakhsh (2022) and  showed that performance on multi-step reasoning tasks does not decrease when models are provided with logically-incorrect prompts. Raghu et al. (2020) also demonstrated that systems such as MAML can effectively "memorize" labels when trained in a way where all labels can be memorized, which further illustrates that, when possible, models may attempt to use prior knowledge rather than adapt to each new task.
Our findings do not dispute the idea that semantic prior knowledge can provide significant benefits to in-context learning. Indeed, we showed that instruction-tuned models cannot follow flipped labels in-context, which is consistent with the findings from . We instead aim to demonstrate that through symbol tuning, language models can retain the benefits of utilizing prior knowledge while also improving their ability to learn from the input-label pairs shown in the in-context exemplars.

IN-CONTEXT LEARNING VIA IN-CONTEXT EXEMPLARS
At the same time, however, other recent work has suggested that language models can, in fact, learn in-context using the given exemplars. This ability may be more useful than the ability to use semantic prior knowledge because it would allow models to perform tasks that are not seen in or contradict pretraining data. Garg et al. (2022), for instance, showed that transformers trained from scratch can perform in-context learning on linear-regression tasks at a similar performance level as the least-squares estimator. This capability was shown to result from transformers implementing standard learning algorithms such as gradient descent (Akyürek et al., 2023;von Oswald et al., 2022;Dai et al., 2023). Furthermore, Webson & Pavlick (2022) demonstrated that, in a natural language setting, language models can learn at the same rate during finetuning even when given irrelevant or misleading prompts. On a broader level, Rajendran et al. (2020) and Yin et al. (2020) found that adding noise to, shuffling, or regularizing the label space can make systems better at learning and adapting to new tasks. In this paper, we attempt to improve the degree to which language models are able to learn tasks via input-label mappings. Our symbol-tuning method can be seen as a form of label augmentation and is thus similar to the proposed methods from Rajendran et al. (2020) and Yin et al. (2020), though it differs crucially in that we apply them to tune large language models. We found that symbol-tuned models saw significant improvements in their ability to learn in-context (e.g., on algorithmic tasks or settings with underspecified prompts).

TUNING LANGUAGE MODELS
Our work presented symbol tuning, a form of finetuning on input-label pairs where labels are remapped to arbitrary symbols. Symbol tuning relates to a broader body of work showing that finetuning language models can significantly alter their behavior and performance in different settings. For example, Wei et al. (2022a) first presented instruction tuning (finetuning on tasks phrased as instructions) and showed that this finetuning procedure substantially improves model performance in zero-shot settings.  further scaled this procedure by adding more tasks, increasing model sizes, and adding chain-of-thought data, demonstrating that, with these changes, tuned models are significantly better at chain-of-thought reasoning, open-ended generation, and several evaluation benchmarks. Our experimental findings match these results, though our work differs by not only focusing on settings with in-context exemplars and underspecified prompts, but also by modifying the tuning procedure to make tasks harder to learn and require additional reasoning with exemplars.

CONCLUSIONS
In this paper, we presented symbol tuning, a new method of tuning models on tasks where natural language labels are remapped to arbitrary symbols. Symbol tuning is based off of the intuition that when models cannot use instructions or relevant labels to determine a presented task, it must do so by instead learning from in-context exemplars. We tuned four language models (Flan-PaLM-8B, Flan-PaLM-62B, Flan-cont-PaLM-62B, and Flan-PaLM-540B) using our symbol-tuning procedure, utilizing a tuning mixture of 22 datasets and approximately 30k arbitrary symbols as labels.
Experimentally, we showed that symbol tuning can significantly improve a model's ability to learn from in-context exemplars in not only natural language settings, but also on algorithmic tasks. First, we showed that symbol tuning improves performance on unseen in-context learning tasks, especially when prompts do not contain instructions or relevant labels. We also found that symbol-tuned models were much better at algorithmic reasoning tasks, despite the lack of numerical or algorithmic data in the symbol-tuning procedure. Moreover, in an in-context learning setting where inputs have flipped labels, symbol tuning (for some datasets) reunlocks the ability to follow flipped labels that was lost during instruction tuning. Finally, we demonstrated that symbol tuning does not require extensive compute or complex implementations in order to achieve these improvements.
Through symbol tuning, we aim to have increased the degree to which models can examine and learn from input-label mappings during in-context learning. We hope that our results encourage further work towards improving language models' ability to reason over symbols presented in-context.  One unanswered question that arises is whether our results come from the symbol-tuning data or whether they come from the additional steps of tuning. To answer this question, we continue tuning Flan-PaLM models using the same instruction-tuning mixture from  for the same number of steps that the model was symbol tuned using (see Appendix C.3). We then compare these instruction-tuned models with our symbol-tuned models on each reasoning task from Section 5, the flipped-label setting from Section 6, and the ICL settings from Section 4 in Table 2. 6 We find that our symbol-tuned models significantly outperform the models with continued instruction tuning on each of these evaluations. These results suggest that, indeed, the performance improvements on these tasks were not a result of simply tuning the model for more steps. Instead, we conclude that the symbol-tuning data itself is the root cause of the results we observed in this paper.

Algorithmic Reasoning
In  Table 2: Symbol-tuned models perform better than instruction-tuned models on the turing concept and list function tasks from Section 5, the flipped-label setting from Section 6, and the ICL settings without relevant labels from Section 4. Performance change is calculated by subtracting the instruction-tuned model's performance from the symbol-tuned model's performance. Evaluation setups are the same for each task as they were in the respective section that introduced them; performance is shown as the accuracy (%) averaged across all subtasks. Per-task results for list function tasks from Section 5 are shown in Appendix D.1. Per-task results for ICL settings from Section 4 are shown in Appendix D.2.

A.2 DOES SYMBOL TUNING AFFECT PERFORMANCE ON BENCHMARKS?
As shown in Section 4, symbol-tuned models see only minor performance improvements in ICL settings with relevant labels, and small models (e.g., Flan-PaLM-8B) experience performance drops on these settings after symbol tuning. A natural question that follows is whether these differences on our unseen tasks translate to similar differences in well-studied benchmarks, as examples from these benchmarks often contain instructions and relevant labels. In particular, we examine model performance on the MMLU (Hendrycks et al., 2021) and BIG-Bench Hard (Suzgun et al., 2022) benchmarks. For this experiment, we set prompts in a 5-shot setting for MMLU and a 3-shot setting for BIG-Bench Hard, following the settings used in .
In Figure 11, we show model performance on these benchmarks for each symbol-tuned model. We find that small models (i.e., Flan-PaLM-8B) may experience minor performance drops after symbol tuning. This aligns with the result shown in Section 4 and further bolsters the possibility that, after symbol tuning, small models may tend to use prior knowledge less and purely attempt to learn in-context instead. For larger models, on the other hand, symbol tuning only results in performance changes within approximately ±1%, indicating relatively-consistent performance before and after symbol tuning. This consistent performance is expected, however, as symbol tuning is meant to improve a model's ability to learn from and reason with in-context exemplars, and models likely do not use in-context exemplars in order to succeed on these benchmarks. 7

A.3 CAN SYMBOL TUNING IMPROVE CHAIN-OF-THOUGHT REASONING?
One limitation of symbol tuning is that it does not include any data with chain-of-thought (CoT) reasoning (Wei et al., 2022b) since it is unclear how to best replace intermediate steps with symbols. We thus want to examine whether symbol tuning affects chain-of-thought reasoning given its ability to improve in-context learning. To analyze this, we reformat prompts from the two benchmarks in Appendix A.2 to use chain-of-thought prompting and evaluate all symbol-tuned models. We use the same chain-of-thought prompts that were used in Chung et al. (2022).
We show these results in Figure 12. We find that performance is mostly consistent between symboltuned models and their base variants when using CoT prompting. One outlier, however, is that Flan-PaLM-8B experienced a significant drop in CoT performance on BIG-Bench Hard after symbol tuning, though it is unclear why this occurred since it did not experience a drop in CoT performance on MMLU. Other than this outlier, the results are expected, as symbol tuning did not include any CoT prompts and thus should not change a model's performance in CoT settings.

A.4 DOES SYMBOL TUNING AFFECT ZERO-SHOT PERFORMANCE?
Our setup for symbol tuning does not include any zero-shot examples, as an arbitrary symbol that maps an input to a label cannot be learned without any exemplars. This raises the question of whether symbol tuning would harm a model's zero-shot performance, especially since we do not mix in any instruction-tuning data during symbol tuning for the reasons stated in Section 7.2. Intuitively, symbol tuning should not affect zero-shot performance because it should modify a model's ability to learn in-context and not its prior knowledge (which is what would primarily be used in zero-shot settings).
To test this, we test the models on the MMLU benchmark (Hendrycks et al., 2021) and reformat prompts to a zero-shot setting.

MMLU (0-Shot)
Flan-PaLM Flan-PaLM + Symbol tuning (ours) Figure 13: Performance on MMLU in a zeroshot setting does not significantly change after symbol tuning. Accuracy shown is an unweighted average over all tasks (per-task results are shown in Appendix D.5).
In Figure 13, we compare each of our symboltuned model's performance on zero-shot MMLU against their respective Flan-PaLM model. We find that performance is somewhat consistent after symbol-tuning. Symbol-tuned models saw a maximum decrease in performance of 1.7%, though we note that this difference is not sufficiently large to conclude that symbol tuning reduces zero-shot performance due to the variance within the evaluation. For example, continuing instruction-tuning on Flan-PaLM-8B for 1k steps reduces MMLU 5-shot performance from 49.5% to 47.2%, and continuing for another 1k steps improve performance back to 49.0%, which may indicate that for these benchmarks, small differences in performance are not enough to suggest an actual reduction or improvement in a model's true performance. For this reason, we posit that the zero-shot performance before and after symbol-tuning is relatively-consistent for all base models, though we note that there is some ambiguity in this conclusion due to the variance in the performance metric.

A.5 DO SYMBOL-TUNED MODELS REQUIRE FEWER IN-CONTEXT EXEMPLARS?
In Section 4, we showed that symbol-tuned models perform much better than Flan-PaLM models in difficult ICL settings without relevant labels. Our evaluations, however, were all in a setting using four in-context exemplars per class, making it unclear how symbol-tuned models perform relative to baselines when there are fewer or more in-context exemplars that the model can use. Intuitively, symbol tuning should be more effective when there are fewer in-context exemplars available, as having fewer exemplars makes it more difficult to identify the task (and we already showed in Section 4 that symbol-tuned models are better in ICL settings where the task is unclear).
To investigate this, we regenerate evaluations using the same process as described in Section 3.2, except we vary the number of in-context exemplars per class. 8 We then test models on the hardest ICL setting from Section 4 in order to study how instruction-tuned and symbol-tuned models behave relative to the number of available exemplars. These results are shown in Figure 14. We find that the performance difference between symbol-tuned models and their base variants is relatively consistent in all settings except when there is only one in-context exemplar per class. In this setting, symboltuned models perform much better than base models, and this trend is consistent across all of our tested models. We posit that this could be a result of the Flan-PaLM not recognizing that arbitrary symbols are meant to be used as labels (which is implied because they perform significantly worse than random guessing), while symbol-tuned models already learned that arbitrary symbols can be used as labels. These results suggest that in ICL settings where the task is unclear, symbol tuning improves model performance regardless of the number of in-context exemplars that are provided. A.6 DOES SYMBOL TUNING REQUIRE USING ALL 30K LABELS?
As described in Section 3.1, our symbol-tuning procedure remapped original labels using a set of approximately 30k possible arbitrary symbols. This raises the question, however, of whether symbol tuning requires this large of a label space, and exactly how large of a label space is necessary for successful symbol tuning. Intuitively, we expect that models that are symbol tuned using larger label spaces should match or outperform those that are symbol tuned using smaller label spaces because a larger label space increases the diversity of the symbol-tuning data, which may make it easier to learn that any arbitrary symbol can be used as a label. We study how the size of the label space used for symbol tuning affects model performance by shrinking the label space for each category in Section 3.1. As our experiments from Section 3.1 use 10k possible labels per category, we decrease the label space size by only using 1k, 100, and 10 labels per category for possible labels.
We retune models 9 and evaluate their performance on the ICL settings from Section 4, showing these results in Figure 15. We find that, in general, models perform slightly better after symbol tuning using larger label spaces, but that the performance improvement from using larger label spaces is greater for the smallest model, Flan-PaLM-8B. The improvement seen in Flan-PaLM-8B may suggest that the larger label space's ability to increase the diversity of the symbol-tuning data is important for smaller models that may have a harder time learning a general trend from a small sample size. Combined with the overall trend of improved performance with larger label spaces across model sizes and across ICL settings, we posit that using a larger label space can indeed improve the symbol-tuned model performance to some degree, possibly because the larger label space creates a more-diverse set of prompts for the model to learn from. Flan-cont-PaLM-62B Figure 15: Symbol tuning using a larger label space slightly improves model performance, though the improvement is greater for the smallest model (Flan-PaLM-8B). All models are tuned for 4k steps. Performance is shown as the average accuracy across eleven datasets.

A.7 WHICH CATEGORY OF SYMBOLS IS MOST IMPORTANT DURING SYMBOL TUNING?
For our symbol-tuning procedure, we used symbols drawn from three categories (integers, combinations of characters, and words). Here, we investigate whether any particular category is more important for symbol tuning (one might expect, for example, using labels that are more similar to natural language might better teach models to examine in-context exemplars before using prior knowledge since models are more likely to have priors for those labels). We retune models (we exclude Flan-PaLM-540B to reduce computational costs) using only integers, only character combinations, and only words as labels. In Table 3, we evaluate these models on the algorithmic reasoning tasks from Section 5, the flipped-label setting from Section 6, and the ICL settings from Section 4.
We find that for all model sizes, using only words as labels results in the best performance on flipped labels, indicating that this category best teaches models to examine in-context exemplars before using prior knowledge. Additionally, symbol tuning using words often yields the best performance when relevant labels are unavailable, but for Flan-PaLM-8B, yields the worst performance when relevant labels are available. This may suggest that small models learn to treat all natural language labels as arbitrary symbols, even when the label is relevant and could be utilized to better learn the task. Finally, while one might expect symbol tuning with numbers to be key to improving on algorithmic tasks, Flan-PaLM-8B and Flan-PaLM-62B actually perform better when tuned using only words (there is no consistently-better label type for Flan-cont-PaLM-62B).  Table 3: Model performance on algorithmic reasoning and in-context learning tasks when symboltuned using only integers, only character combinations, and only words as labels.
A.8 CAN SYMBOL TUNING BE SUCCESSFUL USING RANDOM LABELS?
As a sanity check, we want to show that symbol tuning cannot improve in-context learning when the tuning data is randomized. We expect this behavior since if the input-label mappings are randomized, there is no task to learn from the in-context exemplars and thus no reason to learn to use exemplars.
To show this, we use the same symbol-tuning procedure as before but when remapping labels, we randomly select a symbol for each in-context exemplar rather than assigning a symbol for each label and consistently remapping all instances of that label to the new symbol. This ensures that the labels (despite being arbitrary symbols) are randomized and that there is no meaningful task to learn. We then retune models using symbol-tuning data generated using this modified process. 10 In Figure 16, we show these models' performance on the ICL settings from Section 4. We find that the randomized symbol-tuning procedure is almost always worse than the standard symbol-tuning procedure. In settings without relevant targets, symbol tuning with randomized labels results in equal or worse performance compared with no symbol tuning at all, and model performance is strictly worse than that achieved by standard symbol tuning. In settings with relevant targets, while randomized symbol tuning results in worse performance than no symbol tuning, it outperforms standard symbol tuning for Flan-PaLM-8B, our smallest model. This result is not surprising, however, since in Section 4, we observed a large drop in model performance after symbol tuning for Flan-PaLM-8B in settings with relevant labels (which we posited resulted from the model treating all labels as arbitrary symbols, even when the label could have helped the model learn the task). Overall, these results indicate that, as expected, models do not learn to better utilize in-context exemplars when symbol tuned using exemplars with randomized labels. Here, we show details of the tasks we used for symbol tuning as described in Section 3.1. We selected 22 publicly-available tasks from HuggingFace (Lhoest et al., 2021), ensuring that each task has discrete labels so that there would be labels to swap with our symbols. For each dataset, we used examples from the training split, and because some datasets had more examples than other datasets by multiple orders of magnitude, we cap the number of examples taken from any singular dataset at 25,000. As shown in Table 4, our tuning dataset consists of 291,693 total unique examples.
We selected datasets from several task types as follows: natural language inference (   In this section, we list the eleven tasks from Section 3.2 that we used for our evaluation. We selected eleven publicly-available tasks from HuggingFace (Lhoest et al., 2021). In order to ensure that evaluation tasks were not seen during tuning, we select datasets that were not used in symbol tuning (Appendix B.1) and not used in instruction tuning (specifically, the datasets used in Chung et al. As shown in Table 5, we use the following tasks: subjectivity detection (Conneau & Kiela, 2018, SUBJ), hate speech detection (Basile et al., 2019, TEH), abortion stance classification (Mohammad et al., 2016, TEAB), atheism stance classification (Mohammad et al., 2016, TEAT), feminism stance classification (Mohammad et al., 2016, TEFE), Hillary Clinton stance classification (Mohammad et al., 2016, TEHI), adverse drug event classification (Alex et al., 2021, ADEC), overruling classification (Alex et al., 2021, OR), organization classification (Alex et al., 2021, SOT), potentially-unfair terms-of-service detection (Alex et al., 2021, TOS), and Twitter complaint detection (Alex et al., 2021, TC). In Table 6, we also show the instructions that we provided for each dataset when instructions are included in the prompt setting.  Dataset Instruction SUBJ "Is the following sentence subjective or objective?" TEH "Label the following tweet based on whether it contains hate speech." TEAB "Read the following tweet and determine its stance on abortion." TEAT "Read the following tweet and determine its stance on atheism." TEFE "Read the following tweet and determine its stance on feminism." TEHI "Read the following tweet and determine its stance on Hillary Clinton." ADEC "Label the following sentence based on whether it is related to an adverse drug event." OR "Label the following sentence based on whether it is overruling or not." SOT "Read the following paper title and institution name and classify the institution as a university, company, or research institute." TOS "Label the following sentence from a Terms of Service based on whether it is potentially unfair." TC "Label the following tweet text based on whether it contains a complaint." In this paper, we experimented using a set of ∼300k arbitrary symbols as shown in Figure 3. When selecting a symbol to replace natural language labels with, we first randomly select a type of symbol from the three categories (integers, combinations of characters 11 , and words 12 ) and then select a random symbol from the available symbols for that category. We did not test other ways of generating arbitrary symbols (e.g., picking random words from the prompt, combining multiple words, combining alphabetical characters and numbers, etc.) and leave this for future work.

C.2 PROMPT FORMATTING
We used ten distinct prompt templates to format inputs and outputs into prompts. During both tuning and evaluation, prompts are randomly generated using one of the following templates ([input] and [label] stand for the input and label of a given example, respectively): •  Table 6 for instructions that we used). Appendix E.2 contains examples of prompts that were generated using these prompt templates with instructions.

C.3 TUNING PROCEDURE
In Table 7, we show tuning details for each model that we symbol tuned. We primarily follow the hyperparameter selection from Chung et al. (2022)-in particular, we use the same batch size, dropout, and learning rate for each model. On the other hand, we showed in Section 7.1 that symbol tuning does not require tuning for as long as instruction tuning does. Because we use packing (Raffel et al., 2020), the effective batch size is larger than the reported number.

D.1 BIG-BENCH LIST FUNCTIONS
We experimented on twenty list function tasks from the List Functions benchmark from BIG-Bench (Srivastava et al., 2022). These list function tasks were selected as the tasks with the highest human accuracy baseline reported in Rule (2020). We describe each of the tasks that we tested in Figure 5 and categorize them into five distinct categories based on the list function used by that task.
The pairings in all tasks are composed of input and output lists that contain numbers from 0 to 9 or numbers from 0 to 99 (these two ranges are separated such that a single list function can have two associated tasks, one for each range). Each task contains 32 input-output pairs-each pairing is used as an evaluation example and for each evaluation example, in-context exemplars examples are randomly selected from the remaining 31 pairs. In Section 4, we evaluated models on evaluation examples generated with four in-context exemplars. We show per-task results from this experiment for base models, continued instruction-tuned variants, and symbol-tuned variants in Table 8.

D.2 IN-CONTEXT LEARNING
We evaluated each model's in-context learning abilities on a set of eleven datasets as described in Section 3.2. We reported results on these tasks using an unweighted average of the per-task accuracies. In Table 9, Table 10, Table 11, and Table 12, we show base model, continued instruction-tuned model, and symbol-tuned model performance for each task. Models have been tuned with the same specifications described in Appendix C.3.  Symbol tuning improves in-context learning in language models   (Hendrycks et al., 2021). We evaluate on MMLU in a five-shot setting where few-shot exemplars are from the "dev" set, following . In this section, we report the "validation" set performance on MMLU for each task. We use the same prompts as Chung et al. (2022), which can be found at https://github.com/jasonwei20/flan-2. Prompts for STEM datasets are also the same as in Chung et al. (2022), which originated from Lewkowycz et al. (2022). We show full experimental results for Flan-PaLM models and symbol-tuned variants (after tuning for 4k steps for 8B and 62B models and 1k steps for 540B models) on MMLU in Table 13, Table 14, Table 15,  Table 16, Table 17, and Table 18.            Suzgun et al. (2022) by choosing tasks where model performance as recorded by Srivastava et al. (2022) was better than the average human rater. There are a total of 23 tasks in BIG-Bench Hard-two of these tasks have three subtasks (Suzgun et al., 2022). Following Chung et al. (2022), we treat these subtasks as distinct tasks and take an unweighted average. Our prompts are the same as those used in  which are also the same as the ones given in Suzgun et al. (2022). These prompts contain three in-context exemplars. We show full experimental results for Flan-PaLM models and symbol-tuned variants (after tuning for 4k steps for 8B and 62B models and 1k steps for 540B models) on BIG-Bench Hard in Table 19, Table 20, and Table 21.   Objects (5) Tracking Shuffled Objects (7) Tracking Shuffled Objects ( Symbol tuning improves in-context learning in language models

D.5 MMLU (ZERO-SHOT)
In this section, we show full experimental results for Flan-PaLM models and symbol-tuned variants (after tuning for 4k steps for 8B and 62B models and 1k steps for 540B models) on MMLU (Hendrycks et al., 2021). These results are from evaluating models in a zero-shot setting rather than in a five-shot setting as was tested in Appendix D.3.      In this section, we provide an example of a full few-shot prompt for each of the 22 datasets used in the main paper. When generating these prompts, we follow the procedure describe in Section 3.1. Namely, prompts use one of ten possible formats shown in Appendix C.2 and contain 2-10 in-context exemplars per class. Original labels are remapped to arbitrary symbols as described in Section 3.1.

Prompt:
Input: A zoo worker is dead and two visitors are seriously injured after a Siberian tiger escaped from the San Francisco Zoo in San Francisco, California in the United States and attacked three people who were inside a cafe. The tiger was shot dead by police who were called to the scene. They found the tiger attacking one of the zoo visitors when they killed it.
A tiger attacked three people in San Francisco.

Output: 4348
Input: After the 1979 Soviet invasion and occupation, 3 million Afghans fled to Pakistan, which was encouraged by hefty Western aid to take them in.
Afghanistan was invaded by the Soviet Union in 1979.

Output: 4348
Input: In the May 2005 general election Michael Howard failed to unseat the Labour Government, although the Conservatives did gain 33 seats, playing the most significant role in reducing Labour's majority from 167 to 66.
In the May 2005 general election Conservatives got 33 seats.

Output: forests
Input: David Millar retained his Tour de France leader's yellow jersey despite crashing in the final two kilometres of the third stage of the race to Nantes.
Tour de France winner is likely to lose the crown.

Output: forests
Input: New Zealand's Qualifications Authority said Friday that it still strongly discourages students from using anything other than full English, but that credit will be given if the answer "clearly shows the required understanding," even if it contains text-speak.
Full English is recommended by New Zealand's Qualifications Authority.

Prompt:
Input: Even before they reached town, they could hear a sound like corn popping. Dora asked what it was, and Dad said it was firecrackers.
Dora asked what the sound was.
Label: MIC Input: Alice tried frantically to stop her daughter from barking at the party, leaving us to wonder why she was behaving so strangely.
Alice's daughter was behaving so strangely.
Label: MIC Input: The sack of potatoes had been placed above the bag of flour, so it had to be moved first.
The bag of flour had to be moved first.

Label: certification
Input: Jim signaled the barman and gestured toward his empty glass.
Jim signaled the barman and gestured toward Jim's empty glass. Lucy had studied so hard.

Label: certification
Input: George got free tickets to the play, but he gave them to Eric, because he was not particularly eager to see it.
George was not particularly eager to see it. Label: Answer:

X = Who were the rioters?
In Kazakhstan on June 19, 1989, young men carrying guns, firebombs, iron bars and stones rioted in Zhanaozen, causing a number of deaths. Y = JMH X = What status did the Marshall Islands have in Germany?
It has been speculated that the crisis over the Carolines with Spain, which almost provoked a war, was in fact "a feint to cover the acquisition of the Marshall Islands", which went almost unnoticed at the time, despite the islands being the largest source of copra in Micronesia. Overview. This prompt contains k = 7 in-context exemplars per class. The original natural language labels ["entailment", "neutral", "contradiction"] have been remapped to ["root", "KVA", "peoples"], respectively. We were unable to tour any of the facilities.
Output: peoples Input: Woodland floors are blanketed with swathes of bluebells, and Gowbarrow Park, immortalized by Wordsworth, has its host of golden daffodils.
Gowbarrow Park is known for its lack of daffodils.
Output: peoples Input: The northernmost village in the National Park and once a mining town, Caleeck, with its pa stel cottages on either side of Chalk Beck, is now rather sleepy.
Caleeck was once a popular tourist spot with its pastel cottages.
Output: KVA Input: In its fiscal year 2000 performance report, the Veterans Administration reported that performance declined with respect to its rating-related claims-processing timeliness and national accuracy rate.
In the fiscal year 2000 report, the VA said performance went down and fewer people were served.
Output: KVA Input: A final factor affecting the environment is the agency's relationship with the Congress and central oversight agencies such as OMB.
Agency's relationship with the Congress do not affect the environment.

Output: peoples
Input: Effects of ambient air pollution on nonelderly asthma hospital admissions in Seattle, Washington 1987Washington -1994 In Seattle, the effects of pollution on asthma patients were measured. None of the EPA rules could receive comments online.

Prompt:
Input: A young girl laughing.

Prompt:
Input: . . . pays tribute to heroes the way julia roberts hands out awards-with phony humility barely camouflaging grotesque narcissism .

Label: 3804
Input: an uninspired preachy and clichéd war film .

Label: 3804
Input: hawke draws out the best from his large cast in beautifully articulated portrayals that are subtle and so expressive they can sustain the poetic flights in burdette's dialogue .

Label: 4839
Input: by candidly detailing the politics involved in the creation of an extraordinary piece of music , [jones] calls our attention to the inherent conflict between commerce and creativity .
Label: 4839 Input: de niro may enjoy the same free ride from critics afforded to clint eastwood in the lazy bloodwork . but like bruce springsteen's gone-to-pot asbury park , new jersey , this sad-sack waste of a movie is a city of ruins .

Label: 3804
Input: zigzag might have been richer and more observant if it were less densely plotted .

Label: 3804
Input: the pianist is the film roman polanski may have been born to make .

Label: 4839
Input: after all the big build-up , the payoff for the audience , as well as the characters , is messy , murky , unsatisfying .

Label: 3804
Input: the movie is . . . very funny as you peek at it through the fingers in front of your eyes .

Label: 4839
Input: the entire cast is first-rate , especially sorvino .

Label: 4839
Input: saddled with an unwieldy cast of characters and angles , but the payoff is powerful and revelatory .

Label: 4839
Input: this may be the first cartoon ever to look as if it were being shown on the projection television screen of a sports bar .

Label: 3804
Input: this pathetic junk is barely an hour long . nevertheless , it still seems endless .

Label: 3804
Input: the woodman seems to have directly influenced this girl-meets-girl love story , but even more reassuring is how its makers actually seem to understand what made allen's romantic comedies so pertinent and enduring .

Label: 4839
Input: awesome creatures , breathtaking scenery , and epic battle scenes add up to another 'spectacular spectacle . '

Label: 4839
Input: the angst-ridden , affluent slacker characters are more grating than engaging .

Label: 3804
Input: a compelling pre-wwii drama with vivid characters and a warm , moving message .

Label: 4839
Input: even those who would like to dismiss the film outright should find much to mull and debate .

Label: 4839
Input: a graceless , witless attempt at mating some like it hot with the wwii espionage thriller .

Prompt:
spot price on 14KT gold is $49.08 dwt in Tampa Bay today -crazy that gold is over $1800/ounceremember when the real price was $300.00.. -> 1527 "Bargain said that ""Iran was the 1st 2help us"". Is that means each take a piece or credit goes2 1country or another? @user @user -> 8517 Overview. This prompt contains k = 8 in-context exemplars per class. The original natural language labels ["duplicate", "not duplicate"] have been remapped to ["womens", "NDY"], respectively.

Prompt:
Input: What is a just society?
Is the american society a bad society? Overview. This prompt contains k = 8 in-context exemplars per class. The original natural language labels ["equivalent", "not equivalent"] have been remapped to ["AFM", "1352"], respectively.

Prompt:
Input: " It 's going to happen , " said Jim Santangelo , president of the Teamsters Joint Council 42 in El Monte .
" That really affects the companies , big time , " said Jim Santangelo , president of the Teamsters Joint Council 42 in El Monte .

Output: 1352
Input: Most other potential buyers are interested only in cherry-picking the most attractive assets .
Other potential suitors are not interested in acquiring only the music business .
Output: 1352 Input: Recall proponents claim to have turned in more than 1.6 million signatures .
Recall sponsors say they have submitted 1.6 million signatures .
Output: AFM Input: Appellate courts across the country have issued differing rulings on the issue , allowing public displays of the Ten Commandments in some cases and banning them in others .
Lower courts have splintered on the issue , allowing depictions of the Ten Commandments in some instances and not in others .
Output: AFM Input: Martha Stewart shares fell $ 2.03 , about 18 percent , to $ 9.17 and were the NYSE 's biggest percentage loser .
Its shares fell 4.6 percent , or $ 4.04 , to $ 83.38 and was the blue-chip Dow 's biggest percent loser .
Output: AFM Input: A new variant of Blaster also appeared Wednesday and seemed to be spreading , according to antivirus companies .
The new variation of Blaster was identified Wednesday , according to antivirus company Sophos .
Output: 1352 Input: While robbery appeared to be the motive , the suspects drove off before taking anything .
While robbery appeared to be the motive , the suspects fled before they could take anything , he said .
Output: AFM Input: Both NASA and Russian space officials said it posed no danger to the crew .
American and Russian space officials stressed there is no immediate danger to the crew or the operation of the orbiting outpost .
Output: 1352 Input: It was developed with consultation from more than 300 leaders in academia , industry , government and the public .
The plan , called The NIH Roadmap , was developed over 14 months with help from more than 300 consultants in industry and academia .
Output: 1352 Input: A picture of the doctor 's son holding the guitar appeared in the National Enquirer just two weeks after George died .
A photograph of the doctor 's son holding the guitar appeared in the National Enquirer two weeks after Harrison 's death .
Output: AFM Input: The dollar was last at $ 1.1149 to the euro , close to its strongest level since April 30 .
The dollar pushed as high as $ 1.1115 to the euro in early trade , extending Tuesday 's one percent rally to hit its strongest level since April 30 .
Output: 1352 Input: Aspen Technology 's shares dropped 74 cents , or 23 percent , to close at $ 2.48 on the Nasdaq .
In afternoon trading , Aspen 's shares were off 89 cents or more than 27 percent at $ 2.33 per share .
Output: AFM Input: Egyptologists cast doubt Tuesday on an expedition 's claim that it may have found the mummy of Queen Nefertiti , one of the best-known ancient Egyptians .
Egyptologists think they may have identified the long-sought mummy of Queen Nefertiti , one of the ancient world 's legendary beauties .
Output: 1352 Input: The moment of reckoning has arrived for this West African country founded by freed American slaves in the 19th century .
Taylor is now expected to leave the broken shell of a nation founded by freed American slaves in the 19th century .

Output: 1352
Input: Trade deals between manufacturers and grocery retailers or distributors have long been governed by complicated contracts that offer retailers discounts , money for advertising or payments for prominent shelf space .
Manufacturers and grocers or distributors have a long history of complicated contracts offering retailers discounts , money for advertising or payments for prominent shelf space .

Output: AFM
Input: Nigeria and other African oil producers are increasingly important in U.S. plans to lessen dependence on Middle Eastern suppliers for its energy security .
Nigeria and other African producers are increasingly important in the former Texas oilman 's plans to lessen dependence on Middle Eastern suppliers for energy security .

Output: AFM
Input: " Our own history should remind us that the union of democratic principle and practice is always a work in progress , " Rice said in reference to Iraq .
" Our own histories should remind us that the union of democratic principle and practice is always a work in progress , " she said . Output: Answer: AFM E.1.12 PAWS Overview. This prompt contains k = 4 in-context exemplars per class. The original natural language labels ["paraphrase", "not paraphrase"] have been remapped to ["constitution", "DDX"], respectively.  Vera Zvonareva won the title by beating Caroline Wozniacki in the final 6 -3 , 3 -6 , 6 -3 .

Prompt:
Input: I dabbed the floor with a paper towel.
I spilled juice on the floor.
The floor was permanently stained.
cause Label: NFG Input: The service at the restaurant was slow.
There were many empty tables.
The restaurant was crowded.
cause Label: brother Input: The woman repaired her faucet.
The faucet was leaky.
The faucet was turned off.
cause Label: NFG Input: The woman tolerated her friend's difficult behavior.
The woman knew her friend was going through a hard time.
The woman felt that her friend took advantage of her kindness. He suffered emotional trauma.
He suffered brain damage.
cause Label: brother Input: The patient was dehydrated.
The nurse tested his reflexes.
The nurse gave him an IV.
effect Label: brother Input: The librarian located the book in the stacks.
I requested the librarian's help.
The book was checked out.
cause Label: NFG Input: I put ice cubes in the hot soup.
The soup cooled down.
The soup became thick.
effect Label: NFG Input: The scientist gained recognition.
She discovered a new species.
She received tenure at the university.
cause Label: NFG Input: I ran the ice cube under warm water.
The ice cube stuck to my fingers.
The ice cube vanished.
effect Label: brother Input: The water in the teapot started to boil.
The teapot cooled.
The teapot whistled.

Prompt:
Student: How do I make chocolate chip cookie dough?
In a bowl, mix butter, sugar, brown sugar, vanilla extract, and eggs. To this, add a mixture of flour, baking soda, and salt. Finally, add chocolate chips to the bowl and put the bowl into the oven.
In a bowl, mix butter, sugar, brown sugar, vanilla extract, and eggs. To this, add a mixture of flour, baking soda, and salt. Finally, add chocolate chips to the bowl and mix thoroughly with a wooden spoon.
Teacher: serial Student: How do you remove a flat tire from a car?
Lift the car up on a car jack, using a tire iron loosen the bolts on the wheel until they are off, and pull the tire off.
Lift the car up on a car jack, using a screwdriver loosen the bolts on the wheel until they are off, and pull the tire off.
Teacher: lf Student: To prevent chance of a oil boilover and subsequent fire during deep frying a turkey The turkey MUST be completely defrozen.
The turkey MUST be completely frozen. Put them on a baking sheet and put them in the freezer for 10 minutes at 375 degrees. They'll come out like new.
Put them on a baking sheet and put them in the oven for 10 minutes at 375 degrees. They'll come out like new.
Teacher: serial Student: how ot make mashed potatoes with skin Bring a pot of lightly salted water to a boil. Add peeled potatoes, and cook until tender, about 15 minutes. Drain potatoes, and transfer to a bowl. Add butter, and mash with a potato masher or electric mixer until potatoes are starting to become smooth. Add milk and sour cream, and mix to your desired texture.
Bring a pot of lightly salted water to a boil. Add unpeeled potatoes, and cook until tender, about 15 minutes. Drain potatoes, and transfer to a bowl. Add butter, and mash with a potato masher or electric mixer until potatoes are starting to become smooth. Add milk and sour cream, and mix to your desired texture. Overview. This prompt contains k = 7 in-context exemplars per class. The original natural language labels ["world", "sports", "business", "science/technology"] have been remapped to ["KYX", "european", "pillow", "3863"], respectively.

Prompt:
Sentences: GM Europe to Cut 12,000 Jobs in Deal (AP) AP -General Motors Corp.'s European unit Thursday announced a deal that will allow the struggling automaker to cut up to 12,000 jobs; most of them in Germany, where it will offer generous incentives for employees to leave. Overview. This prompt contains k = 3 in-context exemplars per class. The original natural language labels ["abbreviation", "entity", "description and abstract concept", "human being", "location", "numeric value"] have been remapped to ["publicity", "cry", "trains", "ya", "HNY", "substances"], respectively. Overview. This prompt contains k = 4 in-context exemplars per class. The original natural language labels ["true", "false"] have been remapped to ["completion", "availability"], respectively.
Prompt: X = He danced hypnotically while she beat the atabaque.
Reading beats watching television. Here, we provide examples of a full evaluation prompt for each of the 11 datasets used in the main paper. For each dataset, we randomly selected one of the four ICL settings from Figure 4 to show an example from. Each prompt contains k = 4 in-context exemplars per class for simplicity. We follow the process in Section 3.2 for remapping original labels to arbitrary symbols for evaluation.

Prompt:
Question: Is the following sentence subjective or objective?
however , boey and wayne get closer and johnny ( who had broken up with samantha ) falls for his new secretart , the paranoid sabrina .
Answer: 69651 Question: Is the following sentence subjective or objective? the film is almost eerily calm and refuses to take sides . but that lets its insights penetrate all the deeper . Overview. This prompt contains no relevant labels and no instructions. The original natural language labels ["against", "none", "favor"] have been remapped to ["41098", "blob", "SVN"], respectively. Overview. This prompt contains relevant labels and instructions. The original natural language labels are ["against", "none", "favor"].

Prompt:
Question: Read the following tweet and determine its stance on atheism.
#Necessity is the mother of #innovation. Our #Deen is complete -there is no need for innovation. #islam #SemST Answer: against Question: Read the following tweet and determine its stance on atheism.
That nagging doubt you keep having about god? Gods not testing u, it's your intellect trying to tell you your beliefs are bullshit. #SemST Answer: favor Question: Read the following tweet and determine its stance on atheism.
If current trend of constitution making continues, Indigenous peoples would more than Madhesis-Hachhethu #identity #inclusion #SemST Answer: none Question: Read the following tweet and determine its stance on atheism.
No matter what you are going thru, I assure you somewhere in the world somebody has it 10x worse. #bethankful #hope #SemST Answer: none Question: Read the following tweet and determine its stance on atheism.
The desire to reach for the stars is ambitious. The desire to reach hearts is wise. Maya Angelou #wisdom #truth #SemST Answer: none Question: Read the following tweet and determine its stance on atheism.
It's YOUR plan, I just gotta be #willing to work hard & Seek you! #SemST Answer: against Question: Read the following tweet and determine its stance on atheism.
Religious leader's pay depends on convincing you that their pile of superstitions is actually true. #freethinker #SemST Answer: favor Question: Read the following tweet and determine its stance on atheism.
It was a lot of fun hearing @user talk about his new book #SacredCows. @user #GodlessGala #SemST Answer: favor Question: Read the following tweet and determine its stance on atheism.
Meanwhile, whilst others attempted to distract me, I continued on with my quest. I succeeded. #AntiSatan #AntiBibles #SemST Answer: against Question: Read the following tweet and determine its stance on atheism.
God blessed you with 100 dollars? AIDS and Ebola are ripping through kids in Africa, but yo black ass can pay your cable bill. GAWD #SemST Answer: none Question: Read the following tweet and determine its stance on atheism.
Nothing could be more dangerous to the existence of this Republic than to introduce religion into politics -Robert Green Ingersoll #SemST Answer: favor Question: Read the following tweet and determine its stance on atheism.
Calling all Angel. The angels will sing for the innocent. May God bless you. #MaryJaneVeloso #SemST Answer: against Question: Read the following tweet and determine its stance on atheism.
Faithful God, we #pray that we may learn to trust the uncertainty & mystery of walking on water toward you #SemST Answer: Answer: against E.2.5 TEFE Overview. This prompt contains relevant labels but no instructions. The natural language labels are ["against", "none", "favor"].

Prompt:
Input: @user @user the library, quit attempting 2 hurt others just so they'll think the way u do #SemST Overview. This prompt contains no relevant labels and no instructions. The original natural language labels ["adverse drug event", "not adverse drug event"] have been remapped to ["lagoon", "EMQZ"], respectively.

Prompt:
Input: In 1991 the patient were found to be seropositive for HCV antibodies as detected by the ELISA method and confirmed by the RIBA method.
Output: EMQZ Input: These cases were considered unusual in light of the short delay of their onset after initiation of immunosuppressive therapy and their fulminant course: 3 of these patients died of PCP occurring during the first month of treatment with prednisone.
Output: lagoon Input: CONCLUSIONS: These results suggest that clozapine may cause TD; however, the prevalence is low and the severity is relatively mild, with no or mild self-reported discomfort.

Output: lagoon
Input: This case report describes a 13-year-old male with diagnosis of autistic disorder and fetishistic behavior.
Output: EMQZ Input: CONCLUSION: Pancreatic enzyme intolerance, although rare, would be a major problem in the management of patients with CF.
Output: EMQZ Input: We describe a patient who developed HUS after treatment with mitomycin C (total dose 144 mg/m2) due to a carcinoma of the ascending colon.
Output: lagoon Input: After the first oral dose of propranolol, syncope developed together with atrioventricular block.

Output: lagoon
Input: CT-scan disclosed right ethmoid sinusitis that spread to the orbit after surgery.
Output: EMQZ Input: Early detection of these cases has practical importance since the identification and elimination of the causative drug is essential for therapy success.
Prompt: Question: Label the following sentence based on whether it is overruling or not.
to the extent that paprskar v. state, supra, applied the general test of waiver of constitutional rights set forth in johnson v. zerbst, supra, it is no longer viable.
Answer: ELZJ Question: Label the following sentence based on whether it is overruling or not. see boles, 554 so.2d at 961 ([i]f the county and other persons are not bound, then the status of the road as public or private is subject to being litigated again, and the results of later litigation may be inconsistent with the results of the initial litigation.).

Answer: YVM
Question: Label the following sentence based on whether it is overruling or not.
transfer of property from a parent to a child is presumed to be a gift, and the presumption may only be overcome by clear and convincing evidence to the contrary. Overview. This prompt contains relevant labels but no instructions. The natural language labels are ["potentially unfair", "not potentially unfair"].

Prompt:
Take any action that damages or adversely affects, or could damage or adversely affect the performance or proper functioning of the airbnb platform ; -> not potentially unfair F. does not contain any unsolicited or unauthorised advertising, promotional material, "junk mail", "spam", "chain letters", "pyramid schemes" or any other form of solicitation ; and -> not potentially unfair To the maximum extent permitted by law, we (together with our officers, directors, employees, representatives, affiliates, providers and third parties) do not accept any liability for (a) any inaccuracies or omissions in the content displayed on or via the skyscanner services and/or skyscanner platforms ; or (b) any act of god, accident, delay or any special, exemplary, punitive, indirect, incidental or consequential loss or damage of any kind (including, without limitation, lost profits or lost savings), whether based in contract, tort (including negligence), strict liability or otherwise, incurred by you arising out of or in connection with your access to, use of, or inability to access or use, the skyscanner services and/or skyscanner platforms or any content contained provided therein. -> potentially unfair You will not solicit login information or access an account belonging to someone else. -> not potentially unfair We may revise these terms from time to time. -> potentially unfair Supercell may reject, refuse to post or delete any user content for any or no reason, including, but not limited to, user content that in the sole judgment of supercell violates these terms of service. -> potentially unfair Except for any claim relating to your or our intellectual property (such as trademarks, trade dress, domain names, trade secrets, copyrights and patents) ("excluded disputes"), you and onavo agree to resolve through final and binding arbitration any claim between you and onavo, including its affiliates, officers, directors, employees and agents and its affiliates' officers, directors, employees and agents (whether or not such dispute also involves a third party), regarding any aspect of your relationship with us, including these terms, your use of any of onavo's services, your rights of privacy and/or publicity, or any contacts you may have with us, directly or indirectly, for any reason ("dispute"). -> potentially unfair No oral or written information or advice given by the licensor or its authorized representative shall create a warranty. -> not potentially unfair You acknowledge and agree that posting any such user content may result in immediate termination or suspension of your spotify account. -> Answer: potentially unfair

E.2.11 TC
Overview. This prompt contains relevant labels and instructions. The original natural language labels are ["complaint", "no complaint"].

Prompt:
Question: Label the following tweet text based on whether it contains a complaint.
If I can't get my 3rd pair of @beatsbydre powerbeats to work today I'm doneski man. This is a slap in my balls. Your next @Bose @BoseService Answer: complaint Question: Label the following tweet text based on whether it contains a complaint. @NortonSupport @NortonOnline What the hell is a dm 5-10 days to get money back bank account now overdrawn thanks guys Answer: complaint Question: Label the following tweet text based on whether it contains a complaint.
@DanielNewman I honestly would believe anything. People are...too much sometimes.
Answer: no complaint Question: Label the following tweet text based on whether it contains a complaint.
@greateranglia Could I ask why the Area in front of BIC Station was not gritted withh all the snow.

Answer: complaint
Question: Label the following tweet text based on whether it contains a complaint.
@nvidiacc I own two gtx 460 in sli. I want to try windows 8 dev preview. Which driver should I use.
Can I use the windows 7 one.
Answer: no complaint Question: Label the following tweet text based on whether it contains a complaint.

I'm earning points with #CricketRewards
Answer: no complaint Question: Label the following tweet text based on whether it contains a complaint.

@NCIS_CBS
Answer: no complaint Question: Label the following tweet text based on whether it contains a complaint. Y = adverse drug event X = Several hypersensitivity reactions to cloxacillin have been reported, although IgE-mediated allergic reactions to the drug are rare and there is little information about possible tolerance to other semisynthetic penicillins or cephalosporins in patients with cloxacillin allergy. Y = not adverse drug event X = As termination was not an option for the family, the patient was extensively counseled and treated with oral ganciclovir. Y = adverse drug event X = CONCLUSIONS: These results suggest that clozapine may cause TD; however, the prevalence is low and the severity is relatively mild, with no or mild self-reported discomfort. Y = not adverse drug event X = A case study is presented of a licensed practical nurse who developed persistent contact dermatitis. Y = adverse drug event X = After the first oral dose of propranolol, syncope developed together with atrioventricular block. Y = not adverse drug event X = We describe a patient who developed HUS after treatment with mitomycin C (total dose 144 mg/m2) due to a carcinoma of the ascending colon. Y = not adverse drug event X = We report a case of long lasting respiratory depression after intravenous administration of morphine to a 7 year old girl with haemolytic uraemic syndrome. Prompt: X = she claims he is distant and has shut her out . Y = subjective X = there 's no conversion effort , much of the writing is genuinely witty and both stars are appealing enough to probably have a good shot at a hollywood career , if they want one . Y = objective X = jonah was kind of like a mailman except his messages came straight from god . Y = subjective X = and even if everything goes according to steve 's " plan , " is it really enough ? Y = subjective X = when they join forces to track down the mastermind behind the death of cho cho 's master , it leads these unusual partners into uncovering a dangerous conspiracy which puts both of their lives in danger . Y = subjective X = the skills of a calculus major at m . i . t . are required to balance all the formulaic equations in the long-winded heist comedy who is cletis tout ? Y = objective X = so unique and stubborn and charismatic that you want it to be better and more successful than it is . Y = objective X = though talk in the film often turns to death , khatra 's enthusiasm and love of life keep the movie surprisingly upbeat . Y = objective X = sent from the city to investigate the murder of a teenage girl in a small alaska town , a police detective ( pacino ) accidentally shoots his own partner while trying to apprehend a suspect .

Prompt:
Input: @NortonSupport @NortonOnline What the hell is a dm 5-10 days to get money back bank account now overdrawn thanks guys Input: We can each end this contract anytime we want.
Target: not potentially unfair Input: You acknowledge and agree that posting any such user content may result in immediate termination or suspension of your spotify account.
Target: not potentially unfair Input: attempt to probe, scan, or test the vulnerability of any academia.edu system or network or breach any security or authentication measures ; Target: potentially unfair Input: Supercell may reject, refuse to post or delete any user content for any or no reason, including, but not limited to, user content that in the sole judgment of supercell violates these terms of service.
Target: not potentially unfair Input: We may revise these terms from time to time.
Target: not potentially unfair Input: Such termination or suspension may be immediate and without notice.
Target: potentially unfair Input: We believe that you own your data and preserving your access to such data is important.
Target: potentially unfair Input: You may not use the services (other than certain commercial tools) to sell a product or service, increase traffic to your own website or a third-party website for commercial reasons, such as advertising sales, or otherwise undertake any endeavor aimed at deriving revenue.
Target: potentially unfair Input: 2.4 you grant certain content licenses to other users by submitting your content to publicly accessible areas of the service.

Target
Answer: potentially unfair