Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

There have been a lot of interest in the scaling properties of Transformer models. However, not much has been done on the front of investigating the effect of scaling properties of different inductive biases and model architectures. Do model architectures scale differently? If so, how does inductive bias affect scaling behaviour? How does this influence upstream (pretraining) and downstream (transfer)? This paper conducts a systematic study of scaling behaviour of ten diverse model architectures such as Transformers, Switch Transformers, Universal Transformers, Dynamic convolutions, Performers, and recently proposed MLP-Mixers. Via extensive experiments, we show that (1) architecture is an indeed an important consideration when performing scaling and (2) the best performing model can fluctuate at different scales. We believe that the findings outlined in this work has significant implications to how model architectures are currently evaluated in the community.


Introduction
There have been a lot recent interest in the scaling properties of Transformer models (Kaplan et al., 2020;Hernandez et al., 2021;Bahri et al., 2021;Henighan et al., 2020;Tay et al., 2021b;Abnar et al., 2021).However, not much is understood about the scaling properties of different inductive biases imposed by model architectures.Improvements at a a specific scale (compute, size etc) are often assumed to transfer to different scales and compute regions (So et al., 2019;Choromanski et al., 2020;Lan et al., 2019;Dehghani et al., 2018) and new research is often presented in a point-wise fashion with respect to scale.In short, it is not uncommon for new methods to be presented with data points at very specific or limited compute regions (e.g., base size).We believe that understanding the interaction between architecture and scaling laws is crucial as designing models that perform well at diverse scales will likely have significant impact.
This paper is an attempt to understand the effect of inductive bias (architecture) on scaling laws of language models.To this end, we pre-train and finetune over ten diverse model architectures across multiple compute region and scales (e.g., from 15M to 40 Billion parameters).In total, we pre-train and finetune over 100 different models of different architectures and sizes and present insights and challenges at scaling these ten diverse architectures.
We consider a broad spectrum of models in our extensive experiments.Concretely, we consider several well-established Transformer variants (Vaswani et al., 2017) such as Evolved Transformer (So et al., 2019), Universal Transformers (Dehghani et al., 2018) and Switch Transformers (Fedus et al., 2021).We also consider lightweight models such as ALBERT (Lan et al., 2019) and/or efficient Transformers (Tay et al., 2020) such as Performer (Choromanski et al., 2020) and Funnel Transformers (Dai et al., 2020).In our comparison, we are also interested in finding out if general improvements to the Transformer architectures such as Mixture-of-Softmax (Yang et al., 2017) and/or Gated Linear Units (Dauphin et al., 2017;Shazeer, 2020) influence the scaling behaviour of models.Finally, we also evaluate models outside the family of Transformers including Lightweight convolutions (Wu et al., 2019), Dynamic convolutions (Wu et al., 2019) and the recently proposed MLP-Mixers (Tolstikhin et al., 2021).Figure 1 illustrates an overview about the experiments we run.
We also note that scaling these models is not as straightforward as it seems, i.e., there are intricate details of scale that are intertwined with architectural choices which we study in detail in this paper.For example, a distinct feature of Universal Transformers (and ALBERT) is parameter sharing.Hence, compared with standard Transformers, this 12342 1.1e+12 2.2e+12 4.4e+12 8.8e+12 1.8e+13 3.5e+13 7.0e+13 1.4e+14 architectural choice significantly warps the scaling behaviour not only with respect to performance but also amongst compute metrics such as FLOPs, speed and number of parameters (Dehghani et al., 2021a).Conversely, models such as Switch Transformers are on the other end of the spectrum with an uncommon relationship between FLOPs and number of parameters, i.e., they have high parameter to FLOPs ratio.This difficulty makes navigating this landscape challenging.

Our Contributions and Insights
The key contributions of this paper are as follows: • For the first time, we derive scaling laws for different inductive biases and model architectures.We find that this scaling coefficient differs greatly from model to model.We believe this is an important consideration in model development.It turns out that amongst all ten architectures that we consider, the vanilla Transformer has the best scaling behaviour, even if its absolute performance at each compute region is not the greatest.
• We observe that models that operate well in one compute-scale region is not necessarily the best in another compute-region.Moreover, we find that certain models have difficulty scaling despite performing decently (comparably) at lowercompute regions.This has implications, since it is difficult to get the fulll picture of a model's scalability with pointwise comparisons at a certain compute-region.
• We find that when it comes to scaling different model architectures, upstream pre-training perplexity might not correlate well with downstream transfer.Hence, the underlying architecture and inductive bias is also crucial for downstream transfer.
• We highlight the difficulties of scaling with certain architectures and show that some models do not scale (or scale with a negative trend).We also find concerning trends where linear-time attention models such as Performer struggle with scaling up.
2 Related Work Kaplan et al. (2020) studied empirical scaling laws of the decoder-only Transformer language models.They focused on the standard left-to-right language modeling objective with the cross-entropy loss as the performance metric.One of the main findings is that the loss scales as a power-law with three major characteristics of the model training: model size, dataset size and the training compute.
Another somewhat surprising finding is that the model shapes such as width or depth of the Transformer network have minimal effects on the crossentropy loss for a wide range of scales.Subsequent works (Henighan et al., 2020;Hernandez et al., 12343 2021) made similar conclusions for autoregressive generative modeling and for transfer learning, respectively.This finding is also generally supported by (Tay et al., 2021b) but discrepancies were found for the gap between pretraining and finetuninghighlighting the fact that observing downstream performance of large language model is indeed important.In (Tay et al., 2021b), the effect of depth was unusually pronounced for downstream performance.Raffel et al. (2019) studied the effect of pretraining objectives, model structures (e.g., encoderdecoder, decoder-only), pre-training dataset size and training strategy on the transfer learning.They showed that the downstream performance monotonically increases with the model scale (from 60M to 11B parameters).While they studied several model structures, the Transformer implementation is mostly the same as the original Transformer by Vaswani et al. (2017).Conneau et al. (2020); Goyal et al. (2021) scaled-up multilingual encoderonly architectures up to 11B parameters while maintaining the original Transformer implementation.They found that scaling the model improves its cross-lingual ability.Fedus et al. (2021) scaled a sparse model based on Mixture of Experts (MoE) models up to trillion parameters.
While previous studies have repeatedly shown the benefits of scale for language understanding tasks for both dense and sparse Transformers and cross-lingual abilities, all of these used the same Transformer implementation within each studies.With a plethora of improved Transformer architectures proposed in the literature, it is timely to investigate which of these improved architecture has the best scaling properties.The main goal of this paper is to systematically study how inductive biases imposed by these Transformer variants affect the scaling behavior in a shared software and hardware settings.This is in similar spirit to (Narang et al., 2021) that studies the impact of architectures on performance.Our analysis extends that of (Narang et al., 2021) to the model scale axis.
We note that increasingly the number of tokens seen during pretraining has been incorporated in the study of scaling laws (Hoffmann et al., 2022;Muennighoff et al., 2023).Hoffmann et al. (2022) trains decoder-only Transformer language models with casual langauge modeling and evaluates on zero and few-shot tasks.In this work we consider architectures modifications that do not necessar-ily support causal masking and autoregressive decoding.Due to this, we consider encoder-decoder configurations trained with span corruption, and evaluate on downstream finetuned tasks.This creates a more level playing field for architectures that do not support in-context learning.As such we follow (Raffel et al., 2019) to fix the number of pretraining tokens (i.e.sequence length, training steps, batch size) seen by each model.Given the large space of model architectures and scales we aim to study, this also fixes the data size dimension, making our empirical study more tractable.Since we finetune models until convergence, we anticipate the effect of pretraining token amount to be less pronounced than studied in (Hoffmann et al., 2022).

Methods
This section outlines our experimental setup.

Models
This section describes the models we evaluate in our experiments.Our models are largely implemented in a sequence to sequence framework (Sutskever et al., 2014) following the convention of T5 (Raffel et al., 2019).Encoder-decoder models are a natural choice for this experimentation because they can universally express both encoding and decoding tasks.
Transformer Variants We consider several standard Transformer variants.
• Transformers (Vaswani et al., 2017)  Efficient Transformer Variants These class of models are mainly concerned at reducing computational costs, memory usage, or parameter count of models.
• Performer (Choromanski et al., 2020) -A linear time attention model using generalizable kernel attention.For simplicity, we adopt the relu kernel variant for our experiments.We scale Performer in the similar fashion (i.e., uniform scaling) as vanilla Transformers.
• Funnel Transformer (FT) (Dai et al., 2020) A Transformer architecture that downsamples the input sequence across the layer stack.Our implementation uses FT only in the encoder and reverts to vanilla Transformer in the decoder following Narang et al. (2021).
• ALBERT (Lan et al., 2019) -A lightweight transformer architecture that shares parameters across all layers and factorizes the embedding and output softmax layers.For our seq2seq ALBERT, we also share the weights of encoder and decoder.
General Improvements We consider general improvements that are not necessarily tied to Transformers.We select candidates that have shown to do well in Narang et al. (2021).
• Mixture of Softmaxes (Yang et al., 2017) -A transformer architecture adopting the MoS method at the Softmax layer.

Non-Transformer Architectures
We are interested in the scaling behaviour of non-Transformer based architectures such as convolutions and/or mixer architectures.
• Dynamic Convolutions (Wu et al., 2019) -An extension of the Lightweight Convolution to create time-dependent kernels.
• MLP-Mixers (Tolstikhin et al., 2021) -Mixers are recently proposed architectures that learn a lightweight mixing of tokens.Since Mixers have not been used in autoregressive decoding, we only use token-mixers on the input encoder.

Experiment Setup
Our setup, along with all models, are implemented in Mesh TensorFlow (Shazeer et al., 2018), a library with similar interface to TensorFlow but enables distributed model parallelism across multiple workers.For fair comparison, all models are pretrained for 2 19 steps on the english C4 corpus optimized using an inverse square root learning rate with Adafactor (Shazeer and Stern, 2018).All models use the same SentencePiece tokenizer (Kudo and Richardson, 2018) containing 32K subwords.This closely follows the setup in the T5 paper (Raffel et al., 2019).Finetuning is performed for 100K steps on a mixture of GLUE (Wang et al., 2018), SuperGLUE (Wang et al., 2019) and SQuAD (Rajpurkar et al., 2016).We evaluate on both upstream (pre-training) validation perplexity as well as downstream transfer for NLU tasks (GLUE + Super-GLUE + SQuAD) after fine-tuning.We pretrain and finetune our models with 16 TPU-v3 chips with data parallelism.All large models have a model parallelism of 2 and XL models have a model parallelism of 8.

Model Sizes
We consider several different model sizes for each architecture.For models that are straightforward to scale, we simply follow the standard convention in Raffel et al. (2019), moving from small to base, to large and XL.We include a tiny version of each model to observe how different models behave at lower compute regions.For models where it was not straightforward to scale (e.g., Universal Transformers, ALBERT), we tried  to scale them in a similar fashion but faced obvious limitations such as getting ALBERT to have the same number of parameters as T5 XL without incurring a huge number of cost in terms of FLOPs.
For convolutional models, we consider d model to be the hidden size (i.e., channel depth) for the onedimensional convolution layers.Values such as d kv , N H then become redundant.Details on scaling details1 of each architecture can be found in the supplementary material.

Main Results
We report the main results of this paper in Table 1.We report the number of trainable parameters, FLOPs (of a single forward pass) and speed (steps per second).We also report on validation perplexity (on upstream pre-training) and results on 17 downstream tasks.The results are reported aggre-gates of GLUE, SuperGLUE and SQuAD.While we use the same Mesh TensorFlow-based codebase used by Raffel et al. (2019) and hence expect our experimental results to match theirs, we verify that our T5 base does achieve similar results to what is reported in Raffel et al. (2019).

Do all models scale the same way?
We compare on both upstream perplexity and downstream finetuning performance here.
Upstream Perplexity Figure 2 reports the scaling behaviour of all models as we increase the number of FLOPs.We observe that the scaling behaviour of all models are quite unique and distinct, i.e., most of them are quite different from standard Transformers.Perhaps the biggest finding here is that most models (e.g., LConv, Evolved) all seem to be on-par or better than standard Transformers but fail to scale with a higher compute budget.Another interesting trend is that "linear" Transformers such as Performer fail to scale as shown in Figure 2i.The pre-training perplexity metric only decreases by 2.7% going from base to large scale compared to 8.4% of the vanilla Transformer.
Downstream Transfer Figure 3 reports the scaling curves of all models on downstream transfer.The overall finding that most models have distinct scaling curves compared to Transformers is also evident in downstream tasks.It is also noteworthy that most models have a different upstream and downstream scaling curve.We find that some models such as Funnel Transformer and LConvs that seem to hold out pretty well on upstream but suffer substantially on downstream.As for Performer, the performance (disparity) seems to be even greater in downstream as compared to upstream.Notably, the SuperGLUE downstream tasks generally require pseudo cross-attention on the encoder, which models such as convolutions are not equipped to handle (Tay et al., 2021a).To this end, we find that certain models may have difficulty learning the downstream tasks despite good upstream performance.
5.2 Are the best models at each scale different?
Figure 1 shows the Pareto-frontier when plotting compute against upstream and downstream performance.Since the colors of the plot represent different models, we can observe that the best model for every scale and compute region might be different.Moreover, from Figure 3, we can also observe this.For example, the Evolved Transformer seems to do well against the standard Transformer at tiny to small region (downstream) but this quickly changes when scaling the model up.We also observe this with MoS-Transformer where it clearly outperforms vanilla Transformers at some regions but not at others.

Scaling Law for Each Model
Table 2 presents the slope of the fitted linear line α for each model across multiple scenarios.We derive α by plotting F (FLOPs), U (upstream perplexity), D (downstream accuracy), P (number of parameters).In general, most values of α depict Transformer tend to have similar scaling properties to the vanilla Transformer.The GLU-Transformer has similar and slightly worse scaling properties to the vanilla Transformer, even if it was observed to do better in absolute sense on some computeregions.On the other hand, we also observe that there are models which are difficult to scale such as LConv, UT, MLP-Mixer and Performer.This is even more evident on downstream task.We also note that ALBERT scales (trends) negatively 2 (gets worse) as we scale the model up.On the other hand, the metric α U,D measures how the downstream performance scales with upstream performance.Overall, the Switch Transformer does the best on this metric where downstream performance scales well with upstream performance.Generally, models that make less changes to the main Transformer architecture (GLU-Transformer, MoS-Transformer) tend to retain similar scaling behaviours and changing the inductive bias also significantly alters the 2 This version of ALBERT shares parameters across encoder and decoder which may partially explain why we had a hard time scaling up.scaling property of the model.

Do Scaling Protocols influence model architectures in the same way?
We are interested in how different scaling protocols influence the model architectures.Figure 4 shows the effect of scaling depth of four model architectures (MoS-Transformer, Transformer, Evolved Transformer and LConv).Figure 5 shows the effect of scaling width on the same four architectures.Firstly, on upstream (negative log perplexity) curves, we note that while different architectures have a distinct difference in absolute performance, the scaling trend remains quite similar.On downstream, depth scaling (Figure 4) seems to act equally on most architectures with the exception of LConv.Meanwhile, for width scaling, it seems that Evolved Transformers scale slightly better when applying width-scaling.It is also interesting to note that depth-scaling has a much more substantial impact on downstream scaling as opposed to width-scaling.

Epilogue and Conclusion
In this paper, we conducted extensive experiments, pretraining and finetuning of up to 100 models ranging from 10 well-established Transformer and non-Transformer architectures.We showed that 12349 different model architectures can have different scaling behaviours and models performing well in one compute region (or model size) may not do identically well in another compute region.We also showed that model architectures may do well on upstream perplexity but fail to transfer to downstream tasks.Hence, practitioners should be cautious about developing architectures that not only scale well with respect to the upstream perplexity, but also based on downstream performance.While we certainly do not expect researchers to always report model performance across all scales (especially large-scale), we believe that it is good to keep in mind that architectures can perform quite differently at different compute regions.Hence, this might be a good dimension to consider when designing new inductive biases.As such, performing evaluation at a certain compute region may be insufficient to capture the full picture.It is also good to consider if different inductive biases will result in different extents of emergent capabilities (Wei et al., 2022a;Abnar et al., 2020).
We also showed that different model architectures may react differently to different scaling protocols, reaffirming that comparing and benchmarking these models can be very challenging (Dehghani et al., 2021b).When it comes to scaling large models, we show that novel inductive biases can be indeed quite risky which might explain why most state-of-the-art large language models (Rae et al., 2021;Chowdhery et al., 2022;Tay et al., 2022) are based on relatively vanilla architectures.Our advice is to be cautious when staking an expensive run on an architecture that drastically modifies the attention mechanism.Finally, we acknowledge that not every practitioner or researcher would require models that are able to scale to billions of parameters.In that case, inductive biases that are tailored to small or low compute will be sufficient.

Limitations
As with all empirical studies, ours come with its own set of limitations.We only present a sampling of Transformer variants, and it is not exhaustive.Our selection is aimed towards sampling a diversity of architecture approaches to have representation across the entire space of Transformer architectures.As such, we do not claim that our findings hold within a subcategory; for example efficient Transformer variants, which there are many recent works not covered here.Additionally, given the huge number of models considered in this work, while we scaled each model to the best of our ability and present details on how they were scaled, there could always be unexplored hyperparameter settings and other tricks that could get a model to "work" at larger scales.Beyond this work, one could also study the differences in prompting techniques, e.g.chain-of-thought prompting (Wei et al., 2022b), between different architecture and scales.Such findings would be of importance for the research community in the future.Although in either case here, we believe that our findings, i.e. models scale differently and need to be tested, will continue to be relevant.This space will only continue to grow, and future researchers and practitioners must continue to assess the scalability of new models under new use cases.

Scaling Details for Individual Models
For most models, it was reasonable to follow the uniform scaling method in the main T5 sizes.At each size, the hyperparameters are as follows:

FLOPSFigure 1 :
Figure 1: An overview compute-performance (FLOPs vs performance) plot of all the diverse models and architectures we pretrained and finetuned in this study.Colors represent different model architectures and size of the circles represent the size of the model (parameters).

Figure 2 :
Figure 2: Upstream Negative Log-Perplexity of vanilla Transformer compared to other models.

Figure 3 :
Figure 3: Downstream accuracy of vanilla Transformer compared to other models.

Figure 8 :
Figure 8: Quality-Throughput trade off for the upstream Negative Log-Perplexity of vanilla Transformer compared to other models.

Table 1 :
Results on pre-training and finetuning ten different model architectures.Full results (further varying hyperparameters of these models) can be found in the Appendix.

Table 2 :
Slope of a fitted linear line for each model, when we compare FLOPs vs. upstream performance (F, U ), FLOPs vs. downstream performance (F, D), parameter size vs. upstream performance (F, U ), parameter size vs. downstream performance (P, D), and finally upstream vs. downstream performance (U, D).
how well a model scales.For example α F,U is plotting FLOPs against Upstream performance.The only exception is α U,D which is a measure of upstream vs downstream performance.A high α U,D value means that the transfer to the downstream tasks is better as a model scales.Overall, the α value is a metric that represents how well a model performs relatively across all scales Analysis of Slope for each Model In general, we find that the vanilla Transformer has the highest values of α.Models such as Evolved Transformer,GLU-Transformer, MoS-Transformer and Funnel

Table 3 :
Table of model configurations.N L is the number of layers, d f f is the size of the MLP, d model is the hidden size of the model.d kv is the size of each keyvalue vector.N H is the number of heads.

Table 4 :
Scaling for Switch Transformer.N E is the number of experts.

Table 5 :
Table of model configurations.N R is the number of recurrent operations, d f f is the size of the MLP, d model is the hidden size of the model.d kv is the size of each key-value vector.N H is the number of heads.Quality-Parameter trade off for the upstream Negative Log-Perplexity of vanilla Transformer compared to other models.