Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

William Merrill, Vivek Ramanujan, Yoav Goldberg, Roy Schwartz, Noah A. Smith


Abstract
The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude (2 norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such “saturated” networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP. We leverage the emergent discrete structure in a saturated transformer to analyze the role of different attention heads, finding that some focus locally on a small number of positions, while other heads compute global averages, allowing counting. We believe understanding the interplay between these two capabilities may shed further light on the structure of computation within large transformers.
Anthology ID:
2021.emnlp-main.133
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1766–1781
Language:
URL:
https://aclanthology.org/2021.emnlp-main.133
DOI:
10.18653/v1/2021.emnlp-main.133
Bibkey:
Cite (ACL):
William Merrill, Vivek Ramanujan, Yoav Goldberg, Roy Schwartz, and Noah A. Smith. 2021. Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1766–1781, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent (Merrill et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.133.pdf
Code
 viking-sudo-rm/norm-growth
Data
Penn TreebankWikiText-2