Inverse scaling can become U-shaped

Scaling up language models has been empirically shown to improve performance on a wide range of downstream tasks. However, if we were to observe worse performance as a function of scale ("inverse scaling") on certain tasks, this would indicate that scaling can also encourage behaviors that are misaligned with human preferences. The Inverse Scaling Prize (McKenzie et al. 2022) identified eleven such inverse scaling tasks, evaluated on models of up to 280B parameters and up to 500 zettaFLOPs of training compute. This paper takes a closer look at these inverse scaling tasks. We evaluate models of up to 540B parameters, trained on five times more compute than those evaluated in the Inverse Scaling Prize. With this increased range of model sizes and training compute, only four out of the eleven tasks remain inverse scaling. Six out of the eleven tasks exhibit"U-shaped scaling", where performance decreases up to a certain size, and then increases again up to the largest model evaluated (the one remaining task displays positive scaling). In addition, we find that 1-shot examples and chain-of-thought can help mitigate undesirable scaling patterns even further. U-shaped scaling suggests that the inverse scaling trend observed in McKenzie et al. (2022) may not continue to hold for larger models, which we attribute to the presence of distractor tasks that only sufficiently large models can avoid.


Introduction
Scaling up language models (LMs) has been shown to improve model performance for a wide range of downstream tasks and and has been claimed to unlock emergent abilities (Kaplan et al., 2020;Brown et al., 2020;Srivastava et al., 2022;Wei et al., 2022a, i.a.).However, are there any tasks for which performance gets worse as models scale?Tasks that exhibit this property have been referred to as inverse scaling tasks (Lin et al., 2022), and * Equal contribution.σ,λ Work done while at Google.(Chowdhery et al., 2022) on average exhibits U-shaped scaling, which means that performance first decreases and then increases again as the model gets larger.Model scale can be viewed through the axis of either compute (zettaFLOPs for pretraining) or model size (# of parameters)-see Appendix E, Figure 8 for the model size plot.The y-axis denotes the average accuracy of ten tasks that use accuracy as the metric, excluding Prompt Injection that uses loss as the metric.
such tasks can help reveal flaws in the models' training data or objectives (McKenzie et al., 2022a).
The Inverse Scaling Prize was created to identify such tasks for which larger LMs show increasingly undesirable behavior (McKenzie et al., 2023).Submissions were scored based on a range of criteria including inverse scaling strength, task importance, novelty/surprisingness, task coverage, reproducibility, and inverse scaling generality across different models.Eleven tasks were awarded Third Prizes, the datasets for which have been publicly released.The scaling curves for these eleven tasks (see Figure 2 and also Appendix C, Figure 6) were shown on a range of LMs with scales spanning several orders of magnitude in parameters, including Gopher (42M-280B; Rae et al., 2021), Chinchilla (400M-70B; Hoffmann et al., 2022), and an Anthropic internal model (13M-52B).
In this paper, we take a closer look at the scaling behaviors for these eleven tasks.First, we evaluate PaLM models of up to 540B parameters (Chowdhery et al., 2022), trained on about five times more compute than the models evaluated in the Inverse Scaling Prize submissions (see Table 1).Under this setup, we find that six out of the eleven tasks exhibit what we call U-shaped scaling: performance first decreases, and then increases again for larger models.With one task demonstrating positive scaling (monotonically increasing performance) with PaLM, this brings the number of inverse scaling tasks down to four with the additional scale provided in our experiments.This finding of U-shaped scaling is consistent with prior observations of Ushaped scaling on BIG-Bench tasks such as Truth-fulQA (Lin et al., 2022), Persian Idioms, and Identify Math Theorems (Srivastava et al., 2022

U-shaped scaling
Setup.We evaluate PaLM models on all eleven Inverse Scaling Prize tasks.We use 8B, 62B, and 540B PaLM models from the original paper and also include a 1B model trained on 40B tokens, which is 0.2 zettaFLOPs of compute. 2 The parameter count of PaLM 540B is about twice as large as the parameter count of the largest model evaluated in the Inverse Scaling Prize (Gopher 280B), and the amount of compute used is about five times as much-2.5KzettaFLOPs versus 560 zettaFLOPs of Chinchilla 70B.We follow the exact experimental setup from McKenzie et al. (2023), with the same prompts and scoring protocol, where all answer choices are scored and the option with the highest probability is chosen as the prediction.
Results.The results for PaLM on the eleven tasks are shown in Figure 2, with the cross-task average highlighted in the first figure.We also plot results for other LMs as reported in McKenzie et al. (2022b) for comparison.In summary, only four out of eleven tasks remain inverse scaling once the PaLM 540B model is included.Six out of eleven tasks change from inverse scaling to U-shaped, and one task (Repetitive Algebra) shows positive scaling with PaLM.This broad observation of Ushaped scaling demonstrates the difficulty of extrapolating inverse scaling curves to larger models.

Potential explanation.
A natural question about the U-shaped scaling results is, why does performance decrease and then increase again?One speculative hypothesis is the following.Each Inverse Scaling Prize task can be decomposed into two tasks: (1) the true task and (2) a distractor task where performing the distractor task well hurts performance on the true task.Small models cannot perform either task, and performs around chance.Medium-sized models can perform the distractor task, which results in worse performance compared to smaller models.Large models are able to ignore the distractor task and perform the true task, which then leads back to increased performance and potentially solving the task.We describe potential distractor tasks for each of the inverse scaling tasks in Appendix D, Table 3.Note that while it could be possible to measure model performance on the distractor task only, this would be an imperfect ablation since the distractor task and true task could not only have a competing but also a joint effect on performance.We leave further exploration of why U-shaped scaling occurs to future work.

Mitigation for inverse scaling
We next explore possible mitigation strategies for inverse scaling.In Section 2, we hypothesized the primary cause of inverse scaling to be distractor tasks that mislead the models towards a different solution from the true task.Then, in-context demonstrations of a problem/solution pair could discourage the models from solving the distractor task, since the answer according to the true task diverges from the answer according to the distractor task.If such demonstrations are accompanied by explicit rationales, this could guide the models towards identifying the true task even more strongly.To this end, we explore whether 1-shot demonstrations and 1-shot demonstrations with chain-of-thought reasoning improve undesirable scaling patterns.

1-shot demonstrations make all inverse scaling tasks U-shaped or flat
To gauge the effect of demonstrations, we reevaluate the PaLM models on all tasks with 1-shot prompts, using the 1-shot dataset from the official Inverse Scaling Prize data release.This official 1shot dataset is created by pairing each example in the dataset with a randomly sampled, different example.These examples are then simply prepended to the default prompts (see Appendix C, Figure 6).We find that all four tasks that continued to be inverse scaling after including the 540B model shift to U-shaped or flat scaling when prompted with 1-shot demonstrations.Specifically, Pattern Matching Suppression, Into the Unknown, and Prompt Injection change to U-shaped scaling, and Redefine changes to flat scaling (see Figure 3).Furthermore, the performance of the largest model benefits from 1-shot prompting in all four tasks.These results show that even a single example of a problem/solution pair effectively encourages the models towards solving the true task, especially for larger models.
The tasks that were already U-shaped with unmodified prompts remain U-shaped.See Appendix A, Table 2 for full results on all tasks.

Chain-of-thought helps U-shaped scaling become positive scaling
While our 1-shot results are promising in that even a single demonstration helps shift the inverse scaling trend to U-shaped or flat scaling, for most tasks, the performance of the largest model (540B) still fell behind or was not substantially better than the smallest model (1B).This pattern held true for six out of the ten U-shaped or flat tasks with 1-shot.We explore whether chain-of-thought (CoT) prompting can help in such scenarios, based on recent work showing that CoT can greatly improve performance for multi-step reasoning tasks (Wei et al., 2022b;Kojima et al., 2022;Suzgun et al., 2022, i.a.).
For the experiments in this section, we follow the protocol of Wei et al. (2022b) and follow-up work that includes intermediate reasoning steps in the in-context demonstrations.We continue to use a single example as in Section 3.1, but now the demonstrations are paired with step-by-step rationales.Because CoT prompting also requires the models to generate intermediate steps, we use freeform generation followed by exact string match to evaluate model performance.This requires one additional modification to the prompt to facilitate the postprocessing of the model generations.Specifically, the model is prompted to output the final answer following the expression "So the answer is". 3 Other than these changes, the instructions and the structure of the prompts are kept as close as possible to the 1-shot prompts used in Section 3.1.We construct CoT prompts for ten inverse scaling tasks, excluding Prompt Injection that is evaluated on loss instead of classification accuracy.See Appendix C, Figure 7 for examples of CoT prompts.
We show results for six tasks in Figure 4: three classification tasks that were inverse scaling in PaLM (Into the Unknown, Pattern Matching Suppression, and Redefine) and all other U-shaped tasks where the 540B model performed worse or only similarly to the 1B model even after 1-shot (Negation QA, Modus Tollens, and Memo Trap).Overall, CoT substantially improves performance on these tasks with the exception of Redefine where there is a small gain only in the 540B model (∼6% points over 1-shot).The scaling curves change to positive for Into the Unknown, Pattern Matching Suppression, Redefine, and Negation QA, although for Redefine this is a byproduct of smaller models underperforming their 1-shot counterparts.For Memo Trap, we observe an inverted-U-shaped curve where the performance drops slightly with the largest model; nevertheless, there are consistent performance gains via CoT in 8B+ models. 4For Modus Tollens, CoT-prompted models achieved almost perfect accuracy regardless of size (i.e., flat scaling but saturated performance).See Appendix A, Table 2 for full results.(minus the rationale), rather than comparing directly against results from Section 3.1 evaluated on the official dataset that uses a randomly sampled demonstration for each example (also see Appendix A).

Conclusions
This paper has two simple takeaways.First, inverse scaling can turn into U-shaped scaling when evaluated on models of sufficiently large scale, as demonstrated on six out of eleven Inverse Scaling Prize tasks.The prevalence of U-shaped scaling we identified in this paper shows that inverse scaling curves do not necessarily extrapolate to larger models.Second, demonstrations and rationales are effective for mitigating undesirable scaling patterns.
All inverse scaling tasks change to U-shaped or flat scaling when a single demonstration is provided as a part of the prompt.With additional intermediate reasoning steps, many of the U-shaped tasks further shift to positive scaling, as well as substantial performance gains throughout.Taken together, a combination of scaling and prompting techniques appears to be a viable method for mitigating inverse scaling.However, the prompting approaches we explored has limitations in that they require manual construction of demonstrations and reasoning steps tailored to individual tasks.The 0-shot CoT approach proposed by Kojima et al. (2022) is one method that does not require manual prompt construction, but as we show in the additional experiment in Appendix F, the effectiveness of this approach is limited for the inverse scaling tasks.This leaves open an interesting area of future research of developing novel solutions for inverse scaling that do not require explicit demonstrations.

Limitations
The prevalence of U-shaped scaling does not mean that the Inverse Scaling Prize tasks are solved.Even when U-shaped scaling is observed, it is often the case that the performance of the largest model is still close to or worse than the performance of the smallest model (e.g., Resisting Correction, Modus Tollens).For several tasks, the absolute performance of the models are poor, with the best model performing near chance (e.g., Negation QA) or much worse (Pattern Matching Suppression).While we discuss several mitigation strategies to guard against undesirable scaling behavior in the paper, these observations demonstrate the inherently challenging nature of the task, highlighting an opportunity for future research towards improving absolute performance on these tasks.Furthermore, the mitigation strategies explored in this paper require manual construction of demonstrations.While this is relatively low-effort, only requiring one demonstration per task, the example still has to be tailored to individual tasks.We expect future work to develop more generalizable mitigation strategies, possibly inspired by the causes of inverse scaling identified in McKenzie et al. (2023).

A Full results
The full results for all eleven Inverse Scaling Prize tasks reported this paper are shown in Table 2.We used the exact dataset and protocol from McKenzie et al. (2023) for the main experiments (Section 2), and used the officially released 1-shot dataset for the 1-shot experiments (Section 3.1). 5These experiments are marked 1-shot (official).We additionally ran 1-shot experiments where we fixed the 1-shot demonstration to be the same as the CoT demonstration, except for the step-by-step rationale, marked 1-shot (controlled).This is because the official 1-shot dataset used a a randomly sampled example from the dataset as the 1-shot demonstration example, which varied across each example in the test set.Since our CoT experiments (Section 3.2) use a single manually written demonstration for every test example, the CoT results are more directly comparable to the controlled 1-shot experiments where the demonstrations are fixed.

B Prior examples of U-shaped scaling
See Figure 5 for examples of U-shaped scaling reported in the literature.

C Prompts
Figure 6 shows the original prompts of the tasks from the Inverse Scaling Prize. Figure 7 shows examples of the CoT prompts we constructed, with the difference from the official 1-shot prompts highlighted in blue.

D Distractor tasks
A possible hypothesis for why U-shaped scaling emerges is as follows.U-shaped scaling tasks consist of a true task and a distractor task.Mediumsized models are good enough to perform the distractor tasks, which hurts performance compared to smaller models that cannot perform the distractor task nor the true task.Larger models can ignore the distractor task and perform the true task, which leads to increased performance again.We show a speculative decomposition of tasks into the true task and a distractor task in Table 3.

E Model scale: parameters, data, and compute
As shown in Table 4, we computed training FLOPs following the protocol of Brown et al. (2020).See also Figure 8 for the average performance of different LMs on the Inverse Scaling Prize tasks, viewed through the axis of compute and model size.

F 0-shot CoT experiments
We additionally investigate whether 0-shot CoT approach proposed by Kojima et al. ( 2022) is effective against inverse scaling, given that this approach does not require task-specific prompt construction.
Following their method, we first append the original prompt with "Let's think step by step".Then, we extract the rationale generated by the model, and append the rationale after "Let's think step by step".Then, we append "So the answer is" at the end, and prompt the model for the final answer.We run the 0-shot CoT experiments for 8B+ models only, given that 1B models generally show limited ability to perform CoT reasoning (this trend was also evident in our main CoT experiment).The results are shown in Table 2.The results are highly mixed but rarely beneficial-only two tasks clearly benefit from 0-shot CoT compared to the default setup (Pattern Matching Suppression, Modus Tollens).Two tasks only show substantial gains for the 8B model (Hindsight Neglect, Repetitive Algebra).The rest either remains similar (Negation QA, Into the Unknown) or show lower performance (Memo Trap, Redefine, Sig Figs, Resisting Correction).In tasks where 0-shot CoT leads to lower performance, we often observed that the models failed to produce any reasoning chain at all at the The demonstrations contain CoT reasoning and the expression "So the answer is" immediately before the final answer.These demonstrations are prepended to the default prompt containing the actual problem that the model has to solve (Figure 6).The blue highlights denote the difference between the 1-shot CoT prompts and the simple 1-shot prompts used in Section 3.1.(Chowdhery et al., 2022) on average exhibits U-shaped scaling, which means that performance first decreases and then increases again as the model gets larger.Model scale can be viewed through the axis of either compute (zettaFLOPs for pretraining: left) or model size (# of parameters: right).The y-axis denotes the average accuracy of ten tasks that use accuracy as the metric, excluding Prompt Injection that uses loss as the metric.All results are obtained using the exact prompts and evaluation format specified by Inverse Scaling Prize.

Figure 1 :
Figure 1: Across ten tasks from the Inverse Scaling Prize (McKenzie et al., 2022a), PaLM(Chowdhery et al., 2022) on average exhibits U-shaped scaling, which means that performance first decreases and then increases again as the model gets larger.Model scale can be viewed through the axis of either compute (zettaFLOPs for pretraining) or model size (# of parameters)-see Appendix E, Figure8for the model size plot.The y-axis denotes the average accuracy of ten tasks that use accuracy as the metric, excluding Prompt Injection that uses loss as the metric.

Figure 2 :
Figure2: Scaling curves for the eleven Inverse Scaling Prize tasks.Prompt Injection (Injection) uses loss as the evaluation metric and is not included in the average.The only model that has been added in this paper is PaLM(Chowdhery et al., 2022).Results from other models are taken fromMcKenzie et al. (2022b).

Figure 3 :
Figure 3: Providing 1-shot demonstrations changes the four inverse scaling tasks in PaLM to U-shaped or flat scaling.The performance of the largest model benefits from 1-shot prompting in all cases.

Figure 4 :
Figure4: Chain-of-thought (CoT) prompting generally improves performance in 8B+ models, and changes many U-shaped tasks into positive or flat scaling.To control for the effect of the choice of demonstration examples, we compare CoT against 1-shot experiments that use the same fixed demonstration example as our CoT prompts (minus the rationale), rather than comparing directly against results from Section 3.1 evaluated on the official dataset that uses a randomly sampled demonstration for each example (also see Appendix A).

Figure 7 :
Figure 7: Example 1-shot CoT demonstrations for the three classification tasks that are inverse scaling in PaLM.The demonstrations contain CoT reasoning and the expression "So the answer is" immediately before the final answer.These demonstrations are prepended to the default prompt containing the actual problem that the model has to solve (Figure6).The blue highlights denote the difference between the 1-shot CoT prompts and the simple 1-shot prompts used in Section 3.1.

Figure 8 :
Figure 8: Across ten tasks from the Inverse Scaling Prize (McKenzie et al., 2022a), PaLM(Chowdhery et al., 2022) on average exhibits U-shaped scaling, which means that performance first decreases and then increases again as the model gets larger.Model scale can be viewed through the axis of either compute (zettaFLOPs for pretraining: left) or model size (# of parameters: right).The y-axis denotes the average accuracy of ten tasks that use accuracy as the metric, excluding Prompt Injection that uses loss as the metric.All results are obtained using the exact prompts and evaluation format specified by Inverse Scaling Prize.
, see