Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude (\ell_2 norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such “saturated” networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP. We leverage the emergent discrete structure in a saturated transformer to analyze the role of different attention heads, finding that some focus locally on a small number of positions, while other heads compute global averages, allowing counting. We believe understanding the interplay between these two capabilities may shed further light on the structure of computation within large transformers.


Introduction
Transformer-based models (Vaswani et al., 2017) like BERT (Devlin et al., 2019), XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019), and T5 (Raffel et al., 2019) have pushed the state of the art on an impressive array of NLP tasks. Overparameterized transformers are known to be unversal approximators (Yun et al., 2020), suggesting their generalization performance ought to rely on useful biases or constraints imposed by the learning algorithm. Despite various attempts to study these biases in transformers (Rogers et al., 2020;Lovering et al., 2021), it remains an interesting open question what they are, or even how to characterize them in a way relevant to the domain of language.
In this work, we take the perspective that thoroughly understanding the dynamics of gradient descent (GD) might clarify the linguistic biases of transformers, and the types of representations they acquire. We start by making a potentially surprising empirical observation ( §3): the parameter 2 norm grows proportional to √ t (where t is the timestep) during the training of T5 (Raffel et al., 2019) and other transformers. We refer to the phenomenon of growing parameter norm during training as norm growth. Previous work has analyzed norm growth in simplified classes of feedforward networks (Li and Arora, 2019;Ji and Telgarsky, 2020), but, to our knowledge, it has not been thoroughly demonstrated or studied in the more complicated and practical setting of transformers.
Our main contribution is analyzing the effect of norm growth on the representations within the transformer ( §4), which control the network's grammatical generalization. With some light assumptions, we prove that any network where the parameter norm diverges during training approaches a saturated network (Merrill et al., 2020): a restricted network variant whose discretized representations are understandable in terms of formal languages and automata. Empirically, we find that internal representations of pretrained transformers approximate their saturated counterparts, but for randomly initialized transformers, they do not. This suggests that the norm growth implicit in training guides transformers to approximate saturated networks, justifying studying the latter (Merrill, 2019) as a way to analyze the linguistic biases of NLP architectures and the structure of their representations.
Past work (Merrill, 2019;Bhattamishra et al., 2020) reveals that saturation permits two useful types of attention heads within a transformer: one that locally targets a small number of positions, and one that attends uniformly over the full sequence, enabling an "average" operation. Empirically, we find that both of these head types emerge in trained transformer language models. These capabilities reveal how the transformer can process various formal languages, and could also suggest how it might represent the structure of natural language. Combined, our theoretical and empirical results shed light on the linguistic inductive biases imbued in the transformer architecture by GD, and could serve as a tool to analyze transformers, visualize them, and improve their performance.
Finally, we discuss potential causes of norm growth in §5. We prove transformers are approximately homogeneous (Ji and Telgarsky, 2020), a property that has been extensively studied in deep learning theory. With some simplifying assumptions, we then show how homogeneity might explain the √ t growth observed for T5. 1 2 Background and Related Work

GD and Deep Learning Theory
A simple case where deep learning theory has studied the generalization properties of GD is matrix factorization (Gunasekar et al., 2017;Arora et al., 2019;Razin and Cohen, 2020). It has been observed that deep matrix factorization leads to lowrank matrix solutions. Razin and Cohen (2020) argued theoretically that this bias of GD cannot be explained as an implicit regularizer minimizing some norm. Rather, they construct cases where all parameter norms diverge during GD. Similar ideas have emerged in recent works studying feedforward networks. Analyzing biasless ReLU networks with cross-entropy loss, Poggio et al. (2019Poggio et al. ( , 2020 show that the magnitude ( 2 norm) of the parameter vector continues to grow during GD, while its direction converges. Li and Arora (2019) present a similar argument for scale-invariant networks, meaning that scaling the parameters by a constant does not change the output. Studying homogeneous networks, Ji and Telgarsky (2020) show that the gradients become aligned as t → ∞, meaning that their direction converges to the parameter direction. This means the norm will grow monotonically with t. The perspective developed by these works challenges the once conventional wisdom that the parameters 1 Code available at https://github.com/ viking-sudo-rm/norm-growth.
converge to a finite local minimum during GD training. Rather, it suggests that GD follows a normincreasing trajectory along which network behavior stabilizes. These analyses motivate investigation of this trajectory-driven perspective of training.
From a statistical perspective, work in this vein has considered the implications of these training dynamics for margin maximization (Poggio et al., 2019;Nacson et al., 2019;Lyu and Li, 2019). While these works vary in the networks they consider and their assumptions, they reach similar conclusions: GD follows trajectories diverging in the direction of a max-margin solution. As margin maximization produces a simple decision boundary, this property suggests better generalization than an arbitrary solution with low training loss. This point of view partially explains why growing norm is associated with better generalization performance.

NLP and Formal Language Theory
Norm growth has another interpretation for NLP models. Past work characterizes the capacity of infinite-norm networks in terms of formal languages and automata theory. Merrill (2019) and Merrill et al. (2020) propose saturation, a framework for theoretical analysis of the capacity of NLP architectures. A network is analyzed by assuming it saturates its nonlinearities, which means replacing functions like σ and tanh with step functions. This is equivalent to the following definition: Definition 1 (Saturation; Merrill et al., 2020) Let f (x; θ) be a neural network with inputs x and weights θ. The saturated network sf (x; θ) is 2 where the limit exists, and undefined elsewhere.
Saturation reduces continuous neural networks to discrete computational models resembling automata or circuits, making some kinds of formal linguistic analysis easier. For many common architectures, the saturated capacity is known to be significantly weaker than the full capacity of the network with rational-valued weights (Merrill, 2019), which, classically, is Turing-complete for even simple RNNs (Siegelmann and Sontag, 1992).
For example, one can hand-construct an RNN or LSTM encoding a stack in its recurrent memory (Kirov and Frank, 2012). Stacks are useful for processing compositional structure in linguistic data (Chomsky, 1956), e.g., for semantic parsing. However, a saturated LSTM does not have enough memory to simulate a stack (Merrill, 2019). Rather, saturated LSTMs resemble classical counter machines (Merrill, 2019): automata limited in their ability to model hierarchical structure (Merrill, 2020). Experiments suggest that LSTMs trained on synthetic tasks learn to implement counter memory (Weiss et al., 2018;Suzgun et al., 2019a), and that they fail on tasks requiring stacks and other deeper models of structure (Suzgun et al., 2019b). Similarly, Shibata et al. (2020) found that LSTM language models trained on natural language data acquire saturated representations approximating counters.
Recent work extends saturation analysis to transformers (Merrill, 2019;Merrill et al., 2020). Saturated attention heads reduce to generalized hard attention, where the attention scores can tie. In the case of ties, the head output averages the positions with maximal scores. 3 While their power is not fully understood, saturated transformers can implement a counting mechanism similarly to LSTMs (Merrill et al., 2020). In practice, Bhattamishra et al. (2020) show transformers can learn tasks requiring counting, and that they struggle when more complicated structural representations are required. Ebrahimi et al. (2020) find that attention patterns of certain heads can emulate bounded stacks, but that this ability falls off sharply for longer sequences. Thus, the abilities of trained LSTMs and transformers appear to be predicted by the classes of problems solvable by their saturated counterparts. Merrill et al. (2020) conjecture that the saturated capacity might represent a class of tasks implicitly learnable by GD, but it is unclear a priori why this should be the case. This work aims to put this conjecture on more solid theoretical footing: we argue that approximate saturation arises in transformers as a result of norm growth during training. 4

Norm Growth in Transformers
We start with the observation that the parameter parameter model, which was trained using the AdaFactor optimizer (Shazeer and Stern, 2018). Further details are in §A. Fig. 1 shows that the T5 norm follows a √ t trend, where t is time in training steps. The top right of Fig. 1 breaks down the growth trend by layer. Generally, the norm grows more quickly in later layers than in earlier ones, although always at a rate proportional to √ t. 5 Next, in the bottom row of Fig. 1, we plot the cosine similarity between each parameter checkpoint θ t+1 and its predecessor θ t . This rapidly approaches 1, suggesting the "direction" of the parameters (θ t / θ t ) converges. The trend in directional convergence looks similar across layers.
We also train smaller transformer language models with 38M parameters on Wikitext-2 (Merity et al., 2016) and the Penn Treebank (PTB; Marcus et al., 1993). We consider two variants of the transformer: pre-norm and post-norm, which vary in the relative order of layer normalization and residual connections (cf. Xiong et al., 2020). Every model exhibits norm growth over training. 6 Combined, these results provide evidence that the parameter norm of transformers tends to grow over the course of training. In the remainder of this paper, we will discuss the implications of this phenomenon for the linguistic biases of transformers, and then discuss potential causes of the trend rooted in the optimization dynamics.
4 Effect of Norm Growth §3 empirically documented that the parameter norm grows proportional to √ t during T5 pretraining. Now, we move to the main contribution of our paper: the implications of norm growth for understanding transformers' linguistic inductive biases. In particular, Prop. 1 says uniform norm growth across the network guides GD towards saturated networks. Thus, saturation is not just a useful approximation for analyzing networks, but a state induced by training with enough time. Proposition 1 (Informal) Let θ t ∈ R n be parameters at step t for f (x; θ t ). If every scalar parameter θ i t diverges at the same rate up to a constant, then f converges pointwise to a saturated network. 5 We encourage future works that pretrain new transformer language models to track metrics around norm growth. 6 The post-norm transformer achieves 115.79 perplexity on Wikitext-2 and 96.24 on PTB. On the other hand, the pre-norm transformer reaches 66.35 on Wikitext-2 and 26.16 on PTB, slightly outperforming Wang et al. (2019). This is consistent with previous findings (Xiong et al., 2020) showing advantages of pre-norm over post-norm.  The proof is in §B. Prop. 1 assumes not just norm growth, but uniform norm growth, meaning no parameter can asymptotically dominate any other. Notably, uniform growth implies directional convergence. Accepting uniform growth for a given training regimen, we expect transformers to converge to saturated networks with infinite training. Based on §3, the T5 norm appears to grow ∝ √ t uniformly across the network, suggesting the uniform growth condition is reasonable. As we will discuss later in §5, we expect the growth trend to depend heavily on the learning rate schedule.

Saturated Transformers
Having established that norm growth should lead to saturation, we now empirically measure the saturation levels in T5 and other transformer models.
Large transformers are highly saturated. Since θ t empirically grows during training, we expect high cosine similarity between the representations in trained networks and saturated representations. We estimate this as the cosine similarity between f (x; θ) and f (x; cθ) for some large c (in practice, 1,000). We consider the "base" versions of pretrained BERT, RoBERTa, T5, and XLNet (pretrained on masked language modeling), and compute the mean saturation over 100 input sentences from the Brown corpus (Francis and Kučera, 1989). To match standard practice, each sentence is truncated at 512 word pieces. Fig. 2 plots the similarity for each layer of each model. We compare the pretrained transformers against a randomly initialized baseline. For every model type, the similarity is higher for the pretrained network than the randomly initialized network, which, except for T5, is ∼0. For T5 and XLNet, the similarity in the final layer is ≥0.9, whereas, for RoBERTa, the final similarity is 0.65 (although 0.94 in the penultimate layer). For T5 and XLNet, similarity is higher in later layers, which is potentially surprising, as one might expect error to compound with more layers. This may relate to the fact that the norm grows faster for later layers in T5. One question is why the similarity for BERT is lower than these models. As RoBERTa is architecturally similar to BERT besides longer training, we hypothesize that RoBERTa's higher similarity is due to longer pretraining.

Small transformers reach full saturation.
Each of the transformers trained on Wikitext-2 and PTB reached a saturation level of 1.00. It is unclear why these models saturate more fully than the pretrained ones, although it might be because they are smaller. 7 For our LMs, the feedforward width (512) is less than for T5-base, while the encoder depth and width are the same. Other possible explanations include differences in the initialization scheme, optimizer, and training objective (masked vs. next-word modeling). See §A for full hyperparameters.

Power of Saturated Attention
We have shown that transformer training increases the parameter norm ( §3), creating a bias towards saturation ( §4.1). Now, we discuss the computational capabilities of saturated transformers, and empirically investigate how they manifest in pretrained transformers. What computation can saturated transformers perform? We review theoretical background about saturated attention, largely developed by Merrill (2019). Let H (sequence length n by model dimension d) be the input representation to a self attention layer. We assume a standard self attention mechanism with key, query, and value matrices K, Q, V. 8 Saturated attention resembles standard attention where softmax is constrained to a generalization of "argmax" (Merrill, 2019): s attn(H; Q, K, V ) = arg max(HQK H )HV.
7 Qualitatively, we observed that * -small transformers tended to be more saturated than the * -base models. 8 To simplify presentation, we omit bias terms.
We define this vectorized arg max(A) as Crucially, in the case of ties, arg max(A) returns a uniform distribution over all tied positions. Saturated attention can retrieve the "maximum" value in a sequence according to some similarity matrix. It is also capable of restricted counting (Merrill et al., 2020). Formalizing these observations, we identify two useful computational operations that are reducible to saturated self attention: argmax and mean. Let h i represent the input representation at each time step 1 ≤ i ≤ n. 1. Argmax: Set V = Id. Then the self attention mechanism computes a function recovering the element of H that maximally resembles h i according to a quadratic form M = KQ . If there is a tie for similarity, a uniform average of the maximal entries in H is returned.
2. Mean: Parameterize the head to attend uniformly everywhere. Then the head computes a function taking a uniform average of values: These constructions demonstrate some useful computational abilities of saturated transformers. Due to the summation in (1), the mean operation (or near variants of it) can be used to implement counting, which allows recognizing languages like a n b n c n (Merrill et al., 2020). Empirically, Bhattamishra et al. (2020) find trained networks can learn to recognize counter languages that rely on computing means, failing on more complicated languages like Dyck-2. Our findings partially justify why transformers can learn these languages: they lie within the capacity of saturated transformers.

Learned Attention Patterns
Recall that the small language models trained in §4.1 reach 1.00 saturation. It follows that we can convert them to saturated transformers (by multiplying θ by a large constant c) without significantly shifting the representations in cosine space. We will evaluate if the saturated attention heads manifest the argmax and mean constructions from §4.2. As discussed in §4.2, saturated attention can parameterize both argmax and mean heads. An argmax head should attend to a small number of positions. A mean head, on the other hand, attends uniformly over the full sequence. Are both patterns acquired in practice by our models? We plot the distribution of the number of positions attended to by each head in the saturated PTB models in Fig. 3. The distribution is bimodal, with one mode at 1, and the other around 41, representing the mean sequence length of a 83-length encoder with positional masking to prevent lookahead. The empirical mode around 1 corresponds to heads that are argmax-like. The mode around 41, on the other hand, corresponds to mean-like heads, since it implies uniform attention over the masked sequence. Thus, our analysis suggests that analogs of both types of attention heads theorized in §4.2 are ac-quired in transformers in practice. In the pre-norm transformer, which performs substantially better, there are also a small number of heads lying between the two modes. We defer the investigation of the function of these heads to future work.

Explanation for Norm Growth
We have documented norm growth in T5 and other transformers ( §3) and showed how it induces partial saturation in their representations ( §4). This section points towards an understanding of why the parameter norm grows over the course of training, grounded in results about norm growth from deep learning theory. We do not analyze specific optimizers directly; instead, we analyze norm growth within simplified models of training dynamics taken from the literature. We then evaluate how these candidate dynamics models fit T5's training.

Setup
Let δ t ∈ R n denote the optimizer step at time t, i.e., δ t = θ t+1 − θ t . We write η t for the learning rate at t. 9 Let ∇ θt L denote the gradient of the loss with respect to θ t . By GD, we refer to the update δ t = −η t ∇ θt L. 10 In contrast, we will use the term gradient flow to refer to its continuous relaxation, specified by an analogous differential equation:

Homogeneity
We will rely on properties of homogeneous networks, a class of architectures well-studied in deep learning theory (Ji and Telgarsky, 2020).
We further say that f is homogeneous iff there exists some k such that f is k-homogeneous.
Many common components of modern neural networks are homogeneous (Li and Arora, 2019). Furthermore, as various computations within a neural network preserve homogeneity ( §C), some full networks are also homogeneous. An example of a fully homogeneous neural network is a feedforward ReLU network without bias terms.
Why is homogeneity relevant for transformers? Transformers are not homogeneous, but they are almost homogeneous. We formalize this as: In other words, as θ grows, f approximates a homogeneous function with exponentially vanishing error. In §D, we prove transformer encoders without biases are approximately 1-homogeneous. In Fig. 4, we compare the cosine similarity of transformers with and without biases to their saturated variants, as a function of a constant c scaling their weights. An approximately homogeneous function rapidly approach 1.0 as c increases. We find similar curves for transformers with and without biases, suggesting biasless transformers are similarly homogeneous to transformers with biases. 12 Since multiplying two homogeneous functions adds their homogeneity, a transformer encoder followed by a linear classifier is approximately 2homogeneous. A key property of homogeneous functions is Euler's Homogeneity Theorem: the derivative of a k-homogeneous function is (k − 1)homogeneous. Thus, we will assume the gradients of the linear classifier output are roughly 1homogeneous, which under simple GD implies: Assumption 1 Let θ t include all encoder and classifier parameters. Let ∝ ∼ mean "approximately proportional to". For large enough t during transformer training, δ t ∝ ∼ η t θ t .

Aligned Dynamics
We now consider the first candidate dynamics model: aligned dynamics (Ji and Telgarsky, 2020). Analyzing homogeneous networks with an exponential binary classification loss and gradient flow, Ji and Telgarsky (2020) show that the parameters converge in direction, and that the gradients become aligned, meaning that θ t · δ t → θ t δ t . While it is unclear whether transformers will follow aligned dynamics, we entertain this as one hypothesis. Under Ass. 1, alignment implies With the η t = 1/ √ t schedule used by T5 (Raffel et al., 2019), θ t ∝ ∼ exp √ t (see §E.1). This is asymptotically faster than the observed √ t growth, suggesting an alternate dynamics might be at play.

Misaligned Dynamics
Our second candidate model of training is misaligned dynamics, which follows largely from Li and Arora (2019). This can be derived by assuming the gradients are misaligned (i.e., θ t · δ t = 0), which hold for scale-invariant networks (Li and Arora, 2019) and in expectation for random normal gradients. Misalignment implies (derived in §E.2): 12 Lyu and Li (2019) find similar results for feedforward ReLU networks. It is an interesting puzzle why networks with biases appear similarly homogeneous to those without biases. We show in §E.2 that, with the T5 learning rate (2) reduces to θ t ∝ ∼ √ t, as observed empirically for T5. We now further test whether misaligned dynamics are a good fit for T5.

Evaluation
We measure the gradient alignment over the course of training T5. Our alignment metric is the cosine similarity of δ t to θ t . As shown on the left of Fig. 5, the alignment initially rapidly increases to ∼0.15, and then decays to near 0. This supports the hypothesis that the T5 dynamics are misaligned, since the similarity is never high, and may be approaching 0.
On the right of Fig. 5, we plot step size over training in order to evaluate the validity of Ass. 1. At the beginning of training, a chaotic step size seems reasonable, as it is hard to predict the dynamics before approximate homogeneity takes hold. For large t, Ass. 1 combined with the T5 learning rate schedule predicts step size should be roughly constant. 13 This is not exactly what we find: for large t, δ t grows gradually with t. However, the absolute change in step size is small: < 20 across 220M parameters. Thus, we believe Ass. 1 is not unreasonable, though it would be interesting to understand what properties of the optimizer can explain the slight growth in step size. 14

Weight Decay
One feature of practical training schemes not considered in this section is weight decay. When applied to standard GD, weight decay can be written δ t = −η t ∇ θt L − λθ t . Intuitively, it might hinder 13 Since δt ∝ ∼ ηt θt = √ t/ √ t = 1. 14 We believe the sharp drop in δt at the final step is an artifact of the original recording of these checkpoints. norm growth if λ is large. 15 In §F, we report preliminary experiments testing the effect of weight decay on norm growth. Indeed, if λ is set too large, weight decay can prevent norm growth, but within the standard range of values for λ, we find norm growth even in the face of weight decay. However, it is possible these results may change if the optimizer or other hyperparameters are varied.

Conclusion
We empirically found that θ t grows ∝ √ t during T5 pretraining-a fact that may be caused by the approximate homogeneity of the transformer architecture. We proved that norm growth induces saturation, and then showed empirically that T5 and other large transformers become approximately saturated through their pretraining. Examining highly saturated transformer language models, we found the attention heads largely split between two distinct behaviors that can be roughly interpreted as argmax and mean operations. While we lack a precise formal characterization of "semi-saturated" transformers, we conjecture their capacity resembles that of the saturated models. Thus, we believe further analyzing the capabilities of saturated attention may clarify the linguistic biases that emerge in transformers through training, and the mechanisms they use to represent linguistic structure. 15 Common wisdom says that weight decay improves generalization by keeping θt small; however, recent work challenges the assumption that a bias towards small norm is beneficial (Goldblum et al., 2020), suggesting the benefit of weight decay may arise from more subtle effects on the GD trajectory.

A Experimental Details
We provide experimental details for the small language models that we trained. The models were trained for 5 epochs, and the best performing model was selected based on development loss. Reported metrics were then measured on the held-out test set. We used our own implementation of the standard pre-and post-norm transformer architectures. We did not do any hyperparameter search, instead choosing the following hyperparameters: • Batch size of 16 • Model dimension of 768 • Feedforward hidden dimension of 512 • 12 heads per layer • 12 layers • AdamW optimizer with default PyTorch hyperparameters • 0 probability of dropout • Default PyTorch initialization Tokenization For Wikitext-2, 3 tokens in the whole test dataset were unattested in the training set (due to capitalization). To make our model compatible with unseen tokens, we replaced these tokens with <unk>, the same class that appeared for low frequency words at training time, when evaluating the final text perplexity. Due to the small number of tokens that were affected, the impact of this change should be negligible.
Compute We estimate the experiments in this paper took several hundred GPU hours on NVIDIA A100 GPUs over the course of almost two years of on-and-off research time.
T5 We used the historical checkpoints of bsl-0, one of five T5-base models that was trained for the original paper (Raffel et al., 2019).
Measuring Norms As a systematic choice, all measurements of parameter norm include only encoder parameters that are not scalars. We advise other researchers to follow the practice of excluding embedding parameters, as embedding parameters that are infrequently updated may obscure general trends in the network parameters.

Component
k Input k Output Linear k k + 1 Bias Table 1: Effects of network components on homogeneity shown by Li and Arora (2019). We write the "k Output" homogeneity as a function of the "k Input" homogeneity. These facts can be applied recursively to compute the homogeneity of a network. We will show that the same facts hold for approximate homogeneity.

B Norm Growth and Saturation
Proposition 2 (Formal version of Prop. 1) Let θ t ∈ R n be the parameter vector at train step t for a network f (x; θ t ). Assume that, as t → ∞, there exists a scalar sequence c(t) → ∞ and fixed vector θ ∈ (R \ {0}) n such that, for all t, θ t → θ · c(t).
Then f converges pointwise to a saturated network in function space. Proof.

C Approximate Homogeneity
In this section, we will further develop the notion of approximate homogeneity. We will prove that is consistent. In other words, every function can have at most one degree k of approximate homogeneity. Next, we will show that the useful closure properties applying to full homogeneity also apply to partial homogeneity. If f (θ) is approximately k-homogeneous (cf. Def. 3), then f (cθ) = c k f (θ)+ for some error vector where, for each i, | i | ≤ exp(−d θ )), for all c and large enough θ . We use this notation throughout this section.

C.1 Consistency
We first prove that approximate homogeneity is consistent: in other words, if a function is both approximately k 1 and k 2 -homogeneous, then k 1 = k 2 . This is an important property for establishing approximate homogeneity as a meaningful notion.
Lemma 1 Let k 1 , k 2 ∈ N. Assume that f is both approximately k 1 and k 2 -homogeneous. Then k 1 = k 2 .
Proof. If f is both approximately k 1 and k 2homogeneous, then we have vanishing terms 1 and 2 such that, for all c,

Subtracting both sides yields
The right-hand side vanishes exponentially in θ for all c, whereas the left-hand side grows with c unless k 1 = k 2 . Thus, to satisfy this equation for all c, it must be the case that k 1 = k 2 .

C.2 Closure Properties
We now prove that effects of various functions on homogeneity explored by Li and Arora (2019) also translate to approximate homogeneity.
Lemma 2 ReLU preserves approximate khomogeneity, i.e., let f : R n → R be approximately k-homogeneous. Then ReLU •f is approximately k-homogeneous. Proof. Therefore, Lemma 3 Let f, g be vector-valued functions of θ. If f and g are approximately k-homogeneous, then f + g is approximately k-homogeneous. Proof.
Lemma 4 Let f, g be vector-valued functions of θ. If f and g are approximately k f and k ghomogeneous, then f ·g is approximately (k f +k g )homogeneous. Proof.
We now rewrite the term c k f f (θ) g as The analogous results for linear transformation, bias, and affine transformation directly follow from the results for sum and product in Lem. 3 and Lem. 4.
Finally, we show that layer norm converts a homogeneous function to approximately scaleinvariant function. In order to be numerically stable, practical implementations of layer norm utilize a small tolerance term so that the denominator is never zero. We omit this practical detail from our analysis, instead defining the layer norm LN(x) for x ∈ R n according to Lemma 5 Let f be approximately k-homogeneous for some k. Then, LN(f ) is approximately 0homogeneous.
Proof. Since addition preserves approximate khomogeneity, mean (and difference to mean), preserve approximate k-homogeneity. Letting C = c k , we can write We now apply this to the definition of layer norm to get We show that the difference between this and the unscaled layer norm goes to zero. To simplify notation, we now write f = f (θ), µ = µ(f (θ)), and = in the left-hand side below: for some v ∈ R n which does not grow with θ . Thus, setting to this final quantity satisfies the definition of approximate 0-homogeneity, i.e. approximate scale invariance.

C.3 Saturating Activation Functions
We show that the exponentially saturation activation functions σ, softmax, and tanh are approximately scale-invariant in x, i.e. scaling x has an exponentially diminishing effect on the output. We start by analyzing the simpler sigmoid, and then show that the same result holds for softmax. For completeness, we then present a proof for tanh. We use Θ (not θ) in the standard sense of asymptotic notation.
Lemma 6 The scaling error for σ vanishes exponentially in the preactivation magnitude, i.e. for all c ≥ 1, Proof. Assume without loss of generality that x = 0, as if this is the case, the error is 0. When x > 0, we have Lemma 7 The elementwise scaling error for softmax vanishes exponentially in the preactivation norm, i.e. for all c ≥ 1, x ∈ R n s.t. 1 ≤ i ≤ n, Proof. The proof closely follows that of Lem. 6, but is more involved. We consider two cases: x i = max(x), and x i = max(x).
for some d ∈ R. As this has the form of σ, Case 2 x i = max(x).
which is identical to case 1.
Finally, for completeness, we show that tanh exhibits the same property. The proof is very similar to sigmoid, following closely from the definition tanh(x) = exp(2x) − 1 exp(2x) + 1 .
Lemma 8 The scaling error for tanh vanishes exponentially in the preactivation magnitude, i.e. for all c ≥ 1, Proof.
Thus, applying these functions to a homogeneous input produces an output that is approximately scale-invariant in the parameters θ. Thus, these functions act similarly to layer norm, which maps homogeneous input to scale-invariant output. But what happens if the input is approximately homogeneous, rather than strictly homogeneous? In this case, we show that the output is approximately scale-invariant assuming θ is sufficiently large.
Proposition 3 Let f (x; θ) be approximately khomogeneous in θ. Then the following functions are approximately scale-invariant in θ: ). Crucially, since vanishes for large norm, there is some ρ where, for all θ such that ρ < θ : arg max c k f (x; θ) + = arg max c k f (x; θ) .
Therefore, for θ such that θ > ρ, the bounds used in Lem. 6, Lem. 7, and Lem. 8 hold for approximately homogeneous f . Thus, we can conclude that the output is approximately scaleinvariant.

D Transformers
We introduce the notation ∼k-homogeneous to mean approximately k-homogeneous. In this section, we show that the transformer encoder is ∼1homogeneous. A transformer Vaswani et al. (2017) is made up of three main components: an embedding layer, self attention sublayers, and feedforward sublayers. Since the embedding layer is just a matrix multiplication, it is a 1-homogeneous function of the input. Assuming the self attention and feed-forward sublayers have no bias terms, we show that they approximate functions preserving approximate 1-homogeneity. As the full network is an initial embedding layer followed by these sublayers, the final output is ∼1-homogeneous. In the main paper, we discuss the connection between homogeneity and norm growth.
We base our analysis on the HuggingFace implementation 16 of BERT (Wolf et al., 2019). To aid analysis, we make some simplifying assumptions, which are discussed along with the definitions. We later show empirically that homogeneity for the unsimplified versions is similar.

D.1 Transformer Definition
The transformer encoder is a cascade of alternating multi-head self-attention sublayers and feedforward sublayers. Each multi-head self-attention sublayer can be further broken down as an aggregation of self-attention heads. Let LN(·) denote a layer norm followed by a learned affine transformation. Here we will consider the pre-norm transformer variant (Xiong et al., 2020), meaning that LN comes before the residual connection wherever it appears. 17 We will also assume that there are no biases, making all affine transformations into strict linear transformations.
Definition 4 (Self-attention head) Given parameters W k , W q , W v and input X ∈ R T n , we define 16 https://huggingface.co/transformers/ _modules/transformers/modeling_bert. html#BertModel 17 The post-norm transformer applies these operations in the opposite order. a self-attention head attn as where H is the output tensor.
The multi-head self-attention sublayer computes several attention heads in parallel and aggregates them into a single sequence of vectors.
Definition 5 (Multi-head self-attention sublayer) Let X ∈ R T n be the input. We now define the k-multi-head self-attention sublayer MSA k . First, we compute k self-attention heads in parallel to produce H 1 , · · · , H k . We then concatenate these along the feature axis to form H, and compute the sublayer output Y as MSA k (X) = LN(W H) + X.
Finally, the linear sublayer is the other component of the transformer.
Definition 6 (Feedforward sublayer) Let X ∈ R T n be the input. We compute the feedforward sublayer FF according to FF(X) = LN(W f ReLU(W i X)) + X.

D.2 Results
Proposition 4 If X is ∼1-homogeneous in parameters θ, then attn(X; W k , W q , W v ) is ∼1-homogeneous in the concatenation of θ, W k , W q , W v .
Proof. Consider a self-attention layer receiving a ∼1-homogeneous input matrix X ∈ R T n where T is the sequence length. Using the homogeneity rule for multiplication, K, Q, V are each ∼2homogeneous, as homogeneity is additive over multiplication. By the same argument, QK is ∼4homogeneous. In Prop. 3, we show that if the input to softmax is approximately homogeneous, then the output is approximately scale-invariant. Thus, A is approximately 0-homogeneous. Then AV is ∼1-homogeneous.
We show that the multi-head component that aggregates multiple heads into a shared representation also preserves approximate 1-homogeneity.
Proposition 5 If X is ∼1-homogeneous in parameters θ, then MSA is ∼1-homogeneous in the full parameters.
Proof. Since W h is ∼2-homogeneous, LN(W H) is ∼1-homogeneous. The input X is also ∼1homogeneous by assumption, meaning that the sum is also ∼1-homogeneous.
Finally, we turn to analyzing the feedforward sublayer of the transformer.
Proposition 6 If X is ∼1-homogeneous, then FF(X; W f , W i ) is ∼1-homogeneous in the full parameters.
Proof. Multiplying by each W increases approximate homogeneity by 1, and ReLU preserves approximate homogeneity. So the input to LN is ∼3-homogeneous. Thus, its output is ∼1homogeneous, and adding X preserves approximate 1-homogeneity.
Together, these results suggest that the pre-norm transformer output is ∼1-homogeneous, assuming its input is ∼1-homogeneous. This precondition for the input holds in the "base case" of standard embeddings. By induction, we can imagine the output of a biasless pre-norm transformer encoder of any depth to be ∼1-homogeneous.
Interestingly, the homogeneity arguments do not work out if we instead consider the post-norm transformer architecture (Xiong et al., 2020). learning rate η and weight decay λ by training a variety of transformer language models on Wikitext-2 for 1 epoch. 18 We use the AdamW (Loshchilov and Hutter, 2017) optimizer, varying λ and η across a range of common values, keeping all other hyperparameters constant. Fig. 6 visualizes the phase transition for norm growth as a function of λ, η. The norm growth behavior seems to largely depend on weight decay, with a threshold for λ lying between 0.01 and 0.001. While the trend likely depends on the optimizer, we can infer for AdamW at least that norm growth is probable when λ = 0.01, which is a common choice, e.g., reflecting default settings in PyTorch. Thus, while large values of λ will indeed hinder norm growth, we find preliminary empirical evidence that standard choices (∼0.01) do not prevent it.