Training Trajectories of Language Models Across Scales

Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al., 2022)—from 125M to 175B parameters—on next-token prediction, sequence-level generation and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior (Nakkiran et al., 2020); 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; and 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.


Introduction
Scaling up language models has been shown to improve language modeling perplexity (Kaplan et al., 2020;Hernandez et al., 2022) as well as zero-or few-shot end task accuracies (Brown et al., 2020;Rae et al., 2021;Chowdhery et al., 2022;Zhang et al., 2022). However, relatively little is understood about why or how this happens. How do the training dynamics differ as models get larger? What do language models of different sizes learn during pre-training in terms of both generating texts and solving end tasks?
We attempt to make progress to answer these questions by studying the training trajectories of differently-sized OPT models (Zhang et al., 2022) through analyzing their intermediate checkpoints.
In contrast to prior work, which studies the trajectories of small models with up to 300M parameters (Liu et al., 2021;Choshen et al., 2022;Blevins et al., 2022) or focuses on the language modeling objective alone (Kaplan et al., 2020;Hernandez et al., 2021Hernandez et al., , 2022, we are the first to comprehensively study the training trajectories of large-scale autoregressive language models with up to 175B parameters across a wide range of settings. Repeatedly across training and different model scales, we analyze three aspects of model performance: (i) next-token prediction on subsets of tokens (ii) sequence-level generation and (iii) downstream task performance. We use perplexity, which is closely tied to language model evaluation, as the major metric throughout the study.
For next-token prediction ( §3), we study the trajectory by categorizing each token's prediction as stagnated, upward or downward according to its perplexity trend as training progresses. We find each category comprising a significant number of tokens: while a significant number of tokens' perplexity stagnate, a subset of tokens with an increasing perplexity in smaller models exhibit a doubledescent trend (Nakkiran et al., 2020) where perplexity increases and then decreases in larger models. These behaviors primarily emerge at a similar validation perplexity across model scales.
For sequence-level generation ( §4), we study the distribution shift at a document level (50-500 tokens) by decoding sequences that small/large models favor more than the other. Human texts present expected scaling patterns in that they are best modeled by larger (or longer trained) models. However, to our surprise, large models are better at modeling less human-like texts which contain synthetic noise and factually incorrect prompts. We propose an approach to decoding texts that small models favor more than large models from an interpolated distribution induced by combining signals from both models and find them grammatical but hallucinating. 2 All models go through a stage during training where the perplexity for such texts decreases; small models halt at this suboptimal distribution, while larger models escape it by eventually increasing the perplexity of these unnatural texts. We further connect language modeling perplexity to downstream tasks ( §5). By evaluating more than 70 multiple-choice tasks in BIG-Bench (Srivastava et al., 2022), we find that language modeling perplexity correlates well with few-shot incontext learning performance along the trajectory, regardless of model sizes. The gradual divergence of likelihood between correct and incorrect options leads to improvements in in-context learning.
Our work presents a comprehensive study of training trajectories of language models trained with similar procedures, e.g., OPT. We conclude that language models learn the same phenomena in the same order across different model sizes. The overall model perplexity is a composite measure of which language phenomena have been learned.

Experimental Settings
Models. Unless otherwise indicated, all of our experiments use OPT (Zhang et al., 2022), a collection of open-source autoregressive language models. OPT models serve as a good fit for this study due to their controlled pre-training procedures across all model sizes. In particular, all the models share the same tokenization and are trained on the same training data, covering a total of 300B tokens (180B unique). Note that different-sized models differ in batch sizes and total number of steps. 3 We collect intermediate checkpoints from the authors and perform evaluations of these checkpoints across six different sizes: 125M, 1.3B, 6.7B, 13B, 30B, and 175B.
Validation perplexity. Throughout this paper, we use Validation Perplexity (Valid PPL) to refer to the autoregressive language modeling perplexity measured on the entire validation set. We use the original OPT validation set, a held-out subset of the training corpus that covers a wide range of domains, such as books, news, and subtitles. We plot the trajectory of validation perplexity in Figure 1, which follows a similar power-law pattern observed in previous scaling work ( Kaplan et al., 2020;Hoffmann et al., 2022).
Methodology. We aim to understand how models of different sizes behave throughout training as a function of computing (FLOPs) 4 and validation perplexity. Throughout the paper, we use different measurements to characterize model behavior and plot them against these two metrics.

Next-Token Prediction
Autoregressive language models are trained to predict the next token given a context. Figure 1 shows that validation perplexity, aggregated over all positions, gradually declines as training progresses. However, it is not clear if all token instances evolve similarly to the aggregated measurement. In this section, we study the trajectory of next-token predictions, dividing them into three categories-stagnated, upward trend, or downward trend-to understand how language models gradually learn new language phenomena.

Methodology
We evaluate intermediate checkpoints on a subset of validation data. 5 For each context-token pair (c, t), we obtain a series of perplexities PPL m 1 (t | c), PPL m 2 (t | c), . . . , PPL mn (t | c) for checkpoints m 1 , m 2 , . . . , m n . We use linear regression to estimate the slope of a normalized series to roughly capture its trend. Starting from any intermediate checkpoint after p% of training (assuming that it is the j-th checkpoint) to the end checkpoint m n , ∀i ∈ [j, n], we fit the following function to learn the parameters α and β for each series: Note that different starting points might result in different trend estimations. We categorize the trends as follows based on β and its significance: Upward trend. If β > 0 and its p-value is < 0.05, we consider that the series follows an upward trend (forgetting).
Downward trend. If β < −0 and its p-value is < 0.05, we consider that the series follows a downward trend (still learning). i log PPL m i ), we consider the series to be stagnated (already learned).
We design the criteria to roughly capture the trend of the perplexity series of each next-token prediction. Under these criteria, a stagnated series from an earlier checkpoint would continue to stagnate, and a series that follows an upward or downward trend earlier might turn stagnated afterwards. The criteria do not necessarily cover all the series-wavy series with a large variance do not fall within any category and are eliminated. For the rest of the section, for simplicity, we use tokens to refer to context-token pairs.

Analysis
Percentage of tokens. We show the percentage of tokens that follow each trend in Figure 2. Overall, the percentage of stagnated tokens increases and the percentage of the other two types of tokens decreases, indicating that more tokens get to be learned and fewer tokens are still learning or, more Stagnated tokens. We select stagnated tokens starting from 10% of training for a particular model and analyze the trajectory of these same tokens in other models. As shown in Figure 3 (middle), we observe that stagnated tokens after 10% of training in a small model (1.3B) also stagnate in larger models. However, the stagnated tokens selected by a large model (175B) still show a downward trend in smaller models. This suggests that larger models' stagnated tokens are roughly a superset of smaller models. On manual inspection, stagnated tokens are primarily non-content words such as prepositions, determiners, and punctuations.
Upward trend tokens. Similarly, we present the perplexity of upward trend tokens in Figure 4. The leftmost figure shows that such a phenomemon exists for all the models. For tokens that present an upward trend after 10% training of a small model (1.3B), we observe a stepwise double descent (Nakkiran et al., 2020) trend in larger models' trajectories, where the perplexity first increases and then decreases. We are the first to observe this phenomenon during language model training, and it suggests that larger models, with more computation and a larger capacity, first overfit to this subset of tokens and further generalize better for them. For the tokens identified after 20% training of the largest model (175B), the upward trend appears only at the end of training for the 13B and 30B models. We find it hard to characterize these tokens considering their contexts, 7 but the synergy across model sizes 6 Only around 60% tokens are captured by our criteria and please find more details on other tokens in Appendix B.2. 7 More details are in Appendix B.3. strongly suggests that consistent types of learning are triggered at particular computation levels for models across scales. 8 Summary.
In conclusion, large models first replicate small models' behavior on the same subset of tokens, and further unlock exclusive phenomena when fueled with more computation. In Appendix B.5, we find that trajectories of differently-sized models largely overlap when plotting against validation perplexity, indicating that they make similar predictions at a similar perplexity. 9

Sequence-Level Generation
In this section, we extend the analysis from tokenlevel predictions to entire sequences, up to 50-500 tokens. Larger language models consistently obtain a better perplexity in modeling human texts such as Wikipedia, with the perplexity decreasing as the model size and training computation increases ( Figure 1). Autoregressive language models are probabilistic models of sequences that can generate strings of text. If larger models assign a higher probability to virtually all human-authored texts, what sequences do smaller models favor? We aim to first characterize these sequences and further analyze learning behavior on them to understand how models of different sizes evolve into their final distributions. In what follows, we first show that it is difficult to manually design such sequences, as large models can also favor corrupted or factually incorrect texts ( §4.1). We then devise a decoding algorithm to automatically generate sequences fa-

Manual Design
Corrupted datasets. We hypothesize that injecting noise into human texts might reverse the scaling trend (i.e., perplexity on corrupted texts might increase as model size increases). To test this hypothesis, we replace 20%, 40%, 60%, 80%, and 100% of the subwords in each sequence with random subwords. We evaluate corrupted datasets on the final model checkpoints and report the perplexity in Figure 5 (left). Contrary to our hypothesis, downward trends largely retain across all noise levels, even when the entire sequence consists of random tokens (100%). This can be explained by the copy-and-complete interpretation for in-context learning described in Olsson et al. (2022): larger models fare better at making predictions to follow the context distribution than smaller models, even when the context is pure noise. 10 Incorrect options of multiple-choice tasks. We next hypothesize that the perplexity of incorrect options for multiple-choice tasks might present an inverse scaling trend, as they are generally factually wrong. We present the perplexity of correct and incorrect options of 74 multiple-choice tasks from the BIG-Bench dataset in Figure 5. 11 However, we find that the perplexity of correct and incorrect options decreases as the size of the model increases. 12 In summary, our initial attempt failed-we are not able to manually construct texts that are more probable in smaller models than larger models.

Methodology
To continue our search for such texts, we next devise a decoding approach that combines signals from two models and generates texts based on the interpolation of their distributions: where p s and p l are the next-token distributions from the small and large models, respectively, and λ 1 , λ 2 ∈ [−1, 1]. A set of λ 1 and λ 2 denotes a specific configuration. When λ 1 = 0, λ 2 = 1, it is simply decoding with the large model; when λ 1 = 1, λ 2 = −1, the decoding process favors the small model's prediction and suppresses the large model's prediction. This is the configuration that decodes sequences that small models have a lower perplexity on than large models. We further remove tokens that have a negative score, and renormalize the distribution p ′ i to ensure that the sum of the probabilities of all tokens is 1: (3) Generation process. We decode sequences with two models, 125M and 30B, using different configurations of λ 1 and λ 2 . We take the first 5 tokens of a subset of validation documents as prompts and generate 50 tokens conditioned on them. 13 We try greedy search and nucleus sampling (Holtzman et al., 2019) for decoding and evaluate the texts decoded from each configuration as follows: 1) we measure the text perplexity at final checkpoints of different-sized models to understand its scaling trend; 2) we measure the text perplexity at all intermediate checkpoints to understand how the perplexity evolves as training progresses.

Analysis
Inverse scaling. As shown in Figure 6 (row 1), we confirm that the perplexity of texts generated with the p s − p t configuration presents an inverse scaling trend-perplexity increases as model size increases (column 1, 5). Other configurations either only show a modest upward trend (p s ), or a normal downward trend (p l and p l −p s  Figure 6: Perplexity of texts (generated with λ 1 p s + λ 2 p l ) evaluated with differently-sized final model checkpoints (first row) and perplexity trajectory evaluated over intermediate checkpoints against FLOPs (second row). Each column denotes one configuration with different λ 1 and λ 2 . Note that all the texts are generated by combining signals only from 125M and 30B models, but are evaluated over all the model scales. the universality of the phenomenon in other families of language models, we evaluate the generated texts with final GPT Neo checkpoints (Black et al., 2021), which were trained on the Pile dataset (Gao et al., 2020). As shown in Figure 7, the perplexity trend aligns with OPT models. This confirms that the texts generated with our approach are not a result of model or data artifacts, but embody universal properties exhibiting a similar scaling trend in other model families.
Perplexity trajectory of generated sequences.
In the second row of Figure 6, we present the perplexity trajectory of texts generated with different configurations. We observe that texts generated based on p s − p l and, to a less extent, p s , largely differ from the other configurations: 125M checkpoints present a downward trend, while other checkpoints present an upward trend. This might suggest that differently-sized models optimize in different directions for phenomena specific to these texts. However, taking a closer look, we observe that the 1.3B model also shows a downward trend at the beginning, which turns upward afterwards. This indicates that all models improve the perplexity of these texts at first but, with more training FLOPs, larger models shift away from this specific distribution where the 125M model stalls. In Appendix C.7, we further show that perplexity of the sequences decoded by contrasting the two models (p s −p l and p l −p s ) are less aligned with validation perplexity as other configurations.
Generated examples. Table 1 presents examples generated with different configurations. We find that the generations from p s − p l are grammatically correct and carry actual meanings both for greedy search and nucleus sampling, but manifest other issues: 1) they entail highly-unlikely semantic usages such as Fortunately, it wasn't all that greatan ending word with a negative sentiment should be more prevalent; 2) the nucleus sampling examples, despite being fluent and consistent, hardly ground to real world scenarios. This suggests that small models are highly capable linguistically, and learning at scale primarily focuses on acquiring other types of knowledge. 14

Downstream Tasks
In this section, we examine the trajectory of downstream tasks, evaluated on few-shot in-context learning (ICL).

Task Selection and Evaluation
BIG-Bench (Srivastava et al., 2022) is a large collection of tasks for evaluating language models.

We evaluate intermediate checkpoints on its subset
Dist.

Greedy Search Nucleus Sampling
Fortunately, the day wasn't all ... Fortunately, the day wasn't all ... p s −p l that great. The sun was setting and the sun was falling.
I went to bed and woke my husband, who was asleep in his bed, to find that I was still asleep in the middle of the night with him. He was still awake when we that good when the computer said doom and gloom about me. Sure enough, because of our stubborn attempt at terrorizing him via cyberbackup (which relied heavily on computer traffic management (VCMD) to ensure my identity), I was able fix my old p s that bad. I was in the middle of a long day of work and I was in the middle of a long day of work. I was in the middle of a long day of work. I was in the middle of a long day that bad. Not because the weather wasn't bad, but because of how many people didn't move their car around. For those who did, I wanted to say thanks to everyone else who still had a tire change on. That doesn't change p s +p l bad. I was able to get a few things done, and I was able to get a few things done. I was able to get a few things done, and I was able to get a few things done. I was able to cold and we didn't have to set up a heated bed so we wouldn't freeze off in the middle of the night. It was a nice fall day and I had just finished wrapping up the color scheme on the wall. I still haven  of 74 multiple-choice tasks. 15 BIG-Bench comes with predefined templates with a unified QA format for in-context learning, which mitigates the extra complexity of prompt design. 16 We focus on the 2-shot setting. Following Srivastava et al. (2022), we randomly select two incontext learning examples (excluding the evaluation example itself) for each test instance and pick the candidate for each evaluation example that has the highest probability normalized over its length. We use the average 2-shot accuracy of downstream tasks as a proxy for in-context learning capability.

Trajectory of ICL Performance
ICL vs. valid PPL. From Figure 8 (leftmost), it is evident that the downstream task performance strongly correlates with validation perplexity across all model sizes. The curves of different model sizes significantly overlap, indicating that when a small model and a large model are trained to the same perplexity level, they achieve comparable downstream task performance.
ICL vs. other metrics. it is evident that plotting task accuracy against various metrics yields distinct patterns. Notably, when subjected to an equal amount of training FLOPs, the performance of smaller models consistently surpasses that of larger models, with the exception of the 125M model. This observation implies that larger models possess untapped potential for improvement, especially when provided with more training FLOPs or data (Hoffmann et al., 2022;Touvron et al., 2023). Conversely, the remaining two plots indicate that larger models consistently outperform smaller ones when trained with the same number of training tokens and training steps.

Linearity vs. Breakthroughness Tasks
We select 12 tasks that present a linearity scaling pattern and 6 tasks that present a breakthroughness scaling pattern, 17 and plot the perplexity of the correct and incorrect options for each group of tasks against validation perplexity in Figure 9.
The performance of breakthroughness tasks increases tremendously as the validation perplexity drops below 8. The perplexity gap between the correct and incorrect options also starts to expand at this point for the 30B and 175B models. In contrast, the accuracy of linearity tasks gradually increases. The perplexity of correct and incorrect options first  decrease as validation perplexity decreases, and it is only at the end of the curve that the perplexity of correct and incorrect options starts to diverge. This suggests that improvements in downstream accuracy are not generally driven by the model learning to assign a lower probability to incorrect candidates, but rather driven by the perplexity divergence of correct and incorrect options.

Breakthroughness Tasks Learn Smoothly on Trajectory
In Appendix D.4, we provide a detailed analysis of task accuracy in relation to perplexity and FLOPs for individual linearity and breakthroughness tasks. The corresponding plots can be found in Figure  17 and Figure 18. As expected, these plots exhibit a significantly larger variance, showcasing substantial fluctuations in task performance during the training process. However, we still observe a notable alignment between task accuracy and validation perplexity across different model scales. Notably, the breakthroughness tasks, which demonstrate sudden performance improvements at the final checkpoints, display a smooth and continuous growth trend along the training trajectory. This observation reinforces the findings of a recent study conducted by Schaeffer et al. (2023), where they discovered that modifying downstream task metrics results in gradual changes in performance rather than abrupt and unexpected shifts as model scale increases. These results suggest that when examining task performance at a finer level, either through continuous metrics or continuous model checkpoints, task performance largely exhibits a smooth growth pattern in tandem with validation perplexity. Nevertheless, as suggested by Ganguli et al. (2022), accurately predicting the learning curve of a specific task still remains challenging.  (2022) claims the opposite relationship for in-context learning performance and perplexity when training language models with different corpora, but they only test four downstream tasks on a few model checkpoints. Our work extensively evaluates multiple domains and tasks on both language modeling and downstream tasks across checkpoints of different scales, which entails less variance.
Effective scaling Several prior studies have focused on effectively scaling models by examining limited compute settings (Geiping and Goldstein, 2022), exploring different objectives (Tay et al., 2022b;Artetxe et al., 2022b), and investigating different architecture and training setups (Scao et al., 2022b). This work specifically examines model scales under a unified setting, but the proposed techniques can be applied to other settings as well.

Conclusion
To summarize, our study demonstrates that validation perplexity is a reliable indicator of the behavior of OPT models, regardless of their sizes. Larger models, with increased computational power and capacity, exhibit behavior similar to that of smaller models while also unlocking new phenomena and capabilities as validation perplexity decreases further. However, there are certain exceptional cases where models behave differently, sometimes even in opposite directions, such as in the perplexity of texts generated by contrasting two models. This suggests that the underlying model distributions are not entirely identical at the same perplexity level.

Limitations
We discuss the limitations of the work as follows: • One major limitation of our work is that we analyze language models pre-trained with the same data, similar training procedures, and the same autoregressive language modeling objective.  , 2022a), the relationship between validation perplexity and downstream task performance could be more obscure.
• For downstream task evaluation, we only evaluate on multiple-choice tasks, where the evaluation protocol is the most similar to the pretraining objective. Evaluating on generationbased tasks is more messy and hard to scale up, and we will leave it as future work. Another risk is that as we always take aggregated measurements over tasks, it might conceal important patterns of individual tasks.
• We do not provide a concrete explanation for the double-descent behavior that consistently occurs during pre-training, nor do we know if it is an artifact of the data, the objective or the optimization process. We consider it an interesting phenomenon and will look more closely into it in future works.

A Checkpoint Details
We present the checkpoint information in Table 2. OPT models of different sizes are trained with different batch sizes and end up training with different number of steps given the same amount of training tokens. We select early-stage checkpoints every 4K steps for evaluation, and enlarge the interval to 10K or 20K for late stage checkpoints. There are a few checkpoints missing/corrupted from the training process, e.g., 125M 180K, and we have to eliminate them our evaluation.
All OPT models are trained with 300B tokens, of which 180B tokens are unique. This training procedure means that OPTs are trained with repeated data, though training with non-repeating data consistently lead to better performance in language modeling and downstream tasks (Lee et al., 2022;Hernandez et al., 2022).

B.1 Data Used in the Main Paper
We use the Gutenberg PG-19 (Rae et al., 2020) subset as the main dataset for analysis in the main paper. This validation subset contains 50 lines of texts, and we take the first 2048 tokens of each line for analysis, resulting in 102350 context-token pairs. We observe similar patterns when evaluated on other validation subsets such as Wikipedia and opensubtitles, and we omit the results for brevity.

B.2 Trajectory of Other Tokens
We set our criteria to be relatively strict to make sure that the perplexity trajectory of the selected tokens does present the trend (stagnated/upward/downward) we expect. We present the trajectory of the tokens that do not fall into any of the categories in Figure 10. We find that the trend of these tokens are not consistent across models. After 10% of training, the curves of 125M, 1.3B, 6.7B present a slight double-descent trend, and for the rest of the models, the curves present a downward/stagnated trend. After 40% of training, the curves of 125M present a slight double-descent trend towards the end, and the curves of other models present a downward/stagnated trend. This suggests that the rest of the tokens might contain a larger variance in their perplexity trajectories.

B.3 Properties of Stagnated and Upward-Trend Tokens
We show an example paragraph in Table 3, where the stagnated tokens are in blue, upward-trend tokens are in red and downward-trend tokens are in green. It's easy to see that stagnated tokens are mostly connecting words, determiners, punctuation and continuation of words. However, we find it hard to characterize the tokens that present an upward-trend in perplexity simply based on token types. We made attempts to further decipher what language properties this subset might entail based on the part-of-speech tags and positions in sequences, and did not observe any obvious patterns when compared to all the tokens in the validation set. One thing we are sure is that the phenomenon of the upward trend in perplexity as well as the double-descent phenomenon on a certain subset of tokens systematically appears across all model sizes. Therefore, this subset of context-token pairs must embody certain intrinsic language properties, which might be beyond our comprehension so far.  It would be interesting to do an in-depth analysis in understanding why it happens during pre-training, and how it connects to natural language properties.

B.4 More Explorations on Upward Trends
In this section, we explore the subset of tokens that present an upward trend when selected by models of other sizes from the main paper (6.7B, 13B, 30B). We present the perplexity trajectory of these tokens in Figure 11. For the subset of tokens selected after 10% of training of the 6.7B model, the larger models' perplexity also increase but only the largest 175B model presents a double descent behavior where the perplexity declines further. When the tokens are selected after 40% of training of 6.7B, the trends remain similar but the change is mulch more mild. Overall, except the model that is used to select the tokens, the curves of other models present a similar trend, and we will show that these curves overlap with each other almost completely when plotting against validation perplexity in the next subsection. The consistent occurrence of double-descent behavior along the trajectory shows that it's a phenomenon happening universally across the entire autoregressive pre-training process.

B.5 Results against Validation Perplexity
In the main paper, we mostly plot measurements against FLOPs, in this section, we plot the perplexity trajectory of tokens that present different trends against validation perplexity in Figure 12. These figures present the same series of results as Figure 3 and Figure 4, except that the x-axis is validation perplexity. As mentioned in section 2, we use the aggregated perplexity of a number of subsets as the validation perplexity. From Figure 12, we see that given a similar level of validation perplexity, for different subsets of tokens, the trajectories of models across sizes overlap well with each other, suggesting that the predictions for these tokens are similar across model scales at a fixed level of validation perplexity. The only exception is the upward-trend tokens selected after 10 % training of 1.3B, where evaluating with 1.3B presents a clear upward trend as the validation perplexity increases, while the models larger than 1.3B present a overlapping double descentlike trend. This indicates that the underlying distribution of models at the same level of perplexity are largely similar but could differ in edge cases.
These results lays the foundation for downstream Appropri ate ; pertaining to the subject . \n P ect oral . The bone which forms the main rib or support at the forward edge of a bird 's wing . \n Pers istent . Keeping at it ; determination to proceed . \n Per pend icular . At right angles to a surface . This term is sometimes wrongly applied in referring to an object , particularly to an object which is vertical , meaning up and down . The blade of a square is perpend ie ular to the handle at all times , but the blade is vertical only when it points to the center of the earth . \n P ern icious . Bad ; not having good features or possessing wrong attributes . \n P end ulum . A bar or body suspended at a point and adapted to swing to and fro . \n Per pet ual . For all time ; un ending or unlimited time . \n P hen omen a . Some peculiar happening , or event , or object . \n P itch . In aviation this applies to the angle at which the blades of a prope ller are cut . If a prope ller is turned , and it moves forward ly in the exact path made by the angle , for one complete turn , the distance traveled by the prope ller ax ially indicates the pitch in feet . \n Pl acement . When an object is located at any particular point , so that it is operative the location is called the placement . \n Pl ane . A flat surface for supporting a flying machine in the air . Plane of movement per tains to the imaginary surface described by a moving body Appropri ate ; pertaining to the subject . \n P ect oral . The bone which forms the main rib or support at the forward edge of a bird 's wing . \n Pers istent . Keeping at it ; determination to proceed . \n Per pend icular . At right angles to a surface . This term is sometimes wrongly applied in referring to an object , particularly to an object which is vertical , meaning up and down . The blade of a square is perpend ie ular to the handle at all times , but the blade is vertical only when it points to the center of the earth . \n P ern icious . Bad ; not having good features or possessing wrong attributes . \n P end ulum . A bar or body suspended at a point and adapted to swing to and fro . \n Per pet ual . For all time ; un ending or unlimited time . \n P hen omen a . Some peculiar happening , or event , or object . \n P itch . In aviation this applies to the angle at which the blades of a prope ller are cut . If a prope ller is turned , and it moves forward ly in the exact path made by the angle , for one complete turn , the distance traveled by the prope ller ax ially indicates the pitch in feet . \n Pl acement . When an object is located at any particular point , so that it is operative the location is called the placement . \n Pl ane . A flat surface for supporting a flying machine in the air . Plane of movement per tains to the imaginary surface described by a moving body task evaluations, which heavily relies on the pretraining objective for evaluation.

C.1 Details of Corrupted Datasets
We corrupt texts from the opensubtitle subset of the validation set by replacing p% tokens (subwords) with randomly sampled tokens in the sequences. We cap the max length of a sequence to be 100, though changing max length values does not affect the conclusion. Although the perplexity on these corrupted sequences is extremely high, especially when the replacement rate is high, it is still much lower than a truely random model (the perplexity of a random model should be |V | where V is the vocabulary), even for the fully corrupted dataset. It reflects that larger language models are better at exploiting random patterns to produce in-distribution contents than smaller counterparts. We also tried other ways of corruption, such as deleting, inserting, repeating tokens/spans, and all these corruptions result in similar scaling trends.

C.2 Comparison to Li et al. (2022)
Our decoding approach is similar to the contrastive decoding method (CD) proposed in Li et al. (2022), though initially for completely different purposes. The difference between the two methods is in the subtraction space. The contrastive score in CD is defined by dividing the expert probability over amateur probability, which is equivalent to subtraction in the log probability space. Our approach operates subtraction in the probability space directly, ruling out unlikely options where the small model is much more confident than the large model directly. Due to this different design choice, our approach does not need to add the adaptive plausibility restriction, After 40% Training of 30b Model Figure 11: Perplexity of tokens that present an upward trend after 10% or 40% of training of the 6.7B, 13B and 30B models. For each figure, all the models are evaluated on the same subset of tokens.
nor involve any additional hyperparameter. Subtraction in the probability space easily eliminates the false positive cases. We initially propose the approach to decoding sequences that small models favor more than large models to understand the distributional shift across model scales, while contrastive decoding proposed in Li et al. (2022) is a general open-generation approach. Nonetheless, our approach could be an effective and lightweight alternative for open-ended generation without the need to adjust hyperparameters. In Appendix C.4, we show that our approach outperforms nucleus sampling on MAUVE scores.

C.3 Generation Quality
To have a better understanding of the overall quality of the generated sequences, we evaluate these sequences decoded with each configuration in Figure 6 using MAUVE scores (Pillutla et al., 2021). We present the MAUVE scores in Figure 13 . Our generation protocol is slightly different from the standard open-ended generation practices in that we only use 5 tokens as prompts for generation, while usually at least 128 tokens are used (Krishna et al., 2022;Su and Collier, 2022;Li et al., 2022). Using fewer tokens as prompts leads to a higher generation diversity, and the generated distribution could be largely different from the ground-truth sentences. Therefore, we find that the MAUVE scores of our generated sequences are much lower than reported in open-ended generation literature.
Comparing the two decoding protocols, subtrac-tion between two distributions (p s − p l and p l − p s ) leads to a better generation quality than summing the two (p s + p l ) for greedy sampling, but vice versa for nucleus sampling. To verify the effectiveness of the approach, we compare it to nucleus sampling with standard open-generation protocols in Appendix C.4.

C.4 Open-ended Generation Evaluation
We follow the generation protocol in Krishna et al. (2022) for open-ended generation, where we generate sequences with a maximum length of 128 given contexts that have 256 tokens. We decode sequences based on either p l − p s or p l with greedy decoding or nucleus sampling (p = 0.9) and evaluate the quality of the generation with MAUVE scores. We present the results in Table 4. Consistently, our approach to subtracting the probability from a small model from a large model outperforms nucleus sampling with one single model consistently, indicating that our approach has the potential to serve as an effective general decoding method for open-ended generation.

C.5 Generating Longer Sequences
We extend the study to generate longer sequences up to 100 and 500 tokens, and we present perplexity trajectories in Figure 14 and Figure 15, respectively. We find that the inverse scaling trend across model sizes and the opposite perplexity trend between the 125M and 30B also hold for longer sequences. MAUVE scores on generated sequences of different  lengths are largely consistent. The longer the decoded sequences are, the worse the overall quality.

C.6 Examples of Generated Sequences
We present more examples of generated sequences in Table 5 and Table 6. Similar to Table 1, we find that nucleus sampling with p l , p l − p s and greedy search with p l −p s constantly generate high-quality sequences. Greedy decoding p s − p l generates mediocre sequences that are largely grammatical and fluent, but less coherent and sometimes contain hallucinations.

C.7 Validation Perplexity vs. Perplexity of Generated Texts
We plot the perpelxity trajectory of generated texts against validation perplexity in Figure 16. The trajectories largely align well across model sizes for  p s , p s + p l and p l but diverge in the case of p l − p s and p s − p l . This indicates that the underlying distributions of different-sized models given the same perplexity are similar but not exactly identical.

D.1 Task Selection and Evaluation
Out of comuputational considerations, we only evaluate multiple-choice tasks that have fewer than 1000 evaluation examples. The list of selected tasks is shown in Table 7. We report 2-shot in-context learning performance on the default set of each BIG-Bench dataset.

D.2 Prompts
We use fixed prompt formats from the BIG-Bench datasets. Optimizing the prompts might lead to extra margins in performance. Studying the relationship between prompt formats and downstream task performance along the trajectory is interesting, but we consider it out of the scope of this work. We present examples from four datasets in Table 8.  Figure 16: Validation perplexity vs. perplexity of generated texts. We find that models of different scales do not have the same perplexity on the generated texts when decoded with p s − p l or p l − p s given the same validation perplexity, but they largely align when decoded with other configurations.  showing breakthroughness patterns on certain tasks. Previous works mainly study scaling patterns of downstream tasks with final model checkpoints, and we extend this to training trajectories of models across scales. We largely follow Srivastava et al. (2022) to identify tasks with linearity and breakthroughness patterns -the former depicts the trend where the task performance scales with the model size reliably, and for the latter, the performance remains low until a critical model size.

D.3 Linearity and Breakthroughness Tasks
We select 12 tasks that show a linearity pattern and 6 tasks that show a breakthroughness pattern based on the metrics proposed in (Srivastava et al., 2022). For each model size x i and the corresponding performance y i , the metrics are defined as where I(y) = sign(arg max i y i − arg max i y i ) ·(max i y i − min i y i ) is a measure to capture the overall improvement of performance when scaling up. We find that these two measures are not sufficient for identifying the scaling trends for linearity and breakthroughness, thus we also manually check the scaling pattern to verify. The linearity and breakthroughness tasks are lists in Table 9.

D.4 Trajectory of Each Task
We present the scaling curves (on the final model checkpoints) and training trajectories of each linearity and breakthroughness task in Figure 17 and Figure 18. The evaluation of each task presents a large variance across the training steps. Though the tasks might present a breakthroughness pattern on the scaling curves, their trajectory curves show that language models pick up the task gradually.     p l a crush on a guy (Alexander Skarsgard) who is a computer hacker.\nThe Girl with the Dragon Tattoo?\nYup.<s> has been attacked at her apartment by three men. They rape her and beat her. She later finds herself at a center for the mentally disabled. There, she has... See full summary »\nThe beginning of a story. The first nymphomaniac to admit that she has a problem, Helga refuses to seek treatment because she sees it as a surrender to defeat. However, it's... See full summary »\nA young man has just killed a drunken girl in a subway station.
p l −p s sex for the first and last times in The Girl with the Dragon Tattoo.\nI don't remember that in the book.\nIt's implied. She says something like "I've never done this before" when they're having sex for the first time. And when she's having sex with her boyfriend for the last, she says something along the same lines.</s> her face ripped open in The Girl with the Dragon Tattoo. That one still disturbs me. The third movie not as much, but that scene was still disturbing</s> Dist.

Greedy Search Nucleus Sampling
Now in private practice together, ... Now in private practice together, ... p s −p l I'm going through the same process. I've never had any issues.\nI've had the same issue too! I'm trying not get into any of the practices because it's so hard. But now I can't do anything because of it :( I'm hoping to do something with my time and money to get some help!\nThat's really sad! Hopefully I can get some help! I hope to get some advice from someone who knows how to help me out, and that they  Dr. David and Dr. David are a husband and wife team of chiropractors who specialize in the treatment of back pain, neck pain, headaches, and other musculoskeletal problems. They are dedicated to providing the highest quality of care to their patients in a comfortable, friendly, and professional environment.\n\nDr. David is a graduate of the Palmer College of Chiropractic in Davenport, Iowa. He has been practicing in the greater San Diego area since 1995. He Spencer and Field with many years of combined practice are passionate about delivering high quality health care to the people of Texas. "Our mission is to empower you and your family to reach your health and wellness goals through nutritional and lifestyle changes. We take a wholefamily approach to care and believe that true health is created from the inside out. If you're ready to feel better, we want to be part of your journey"</s> p l −p s Drs. Michael J. Gazzaniga and David A. Eagleman have written a new book that explores what they believe are some fundamental mysteries of the human mind. In The Brain: The Story of You, they argue that the brain is not just the seat of our thoughts and emotions but also of who we are as people.\n\nIn this excerpt from the introduction, the authors explain why they wrote the book and what they hope readers take away.\nThe Brain: The...</s> the pair focus their legal expertise on helping immigrant families and individuals resolve a wide range immigration matters, including deportation defense, asylum, naturalization (citizenship), removal defense, consular processing (visas), VAWA petitions (domestic violence) as well as deportation and removal proceedings, appeals and motions before immigration court, administrative motions in immigration court, removal orders and waivers of inadmissability. Both attorneys are admitted to the Maryland State Bar as well as the District of Columbia Court of appeals Table 6: Generated examples with greedy decoding and nucleus sampling under different configurations. The prompt is Now in private practice together,.