BranchNorm: Robustly Scaling Extremely Deep Transformers

Yijin Liu, Xianfeng Zeng, Fandong Meng, Jie Zhou


Abstract
Recently, DeepNorm scales Transformers into extremely deep (i.e., 1000 layers) and reveals the promising potential of deep scaling. To stabilize the training of deep models, DeepNorm attempts to constrain the model update to a constant value. Although applying such a constraint can benefit the early stage of model training, it may lead to undertrained models during the whole training procedure. In this paper, we propose BranchNorm, which dynamically rescales the non-residual branch of Transformer in accordance with the training period. BranchNorm not only theoretically stabilizes the training with smooth gradient norms at the early stage, but also encourages better convergence in the subsequent training stage. Experimental results on multiple translation tasks demonstrate that BranchNorm achieves a better trade-off between training stability and converge performance.
Anthology ID:
2024.findings-acl.695
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11675–11687
Language:
URL:
https://aclanthology.org/2024.findings-acl.695
DOI:
Bibkey:
Cite (ACL):
Yijin Liu, Xianfeng Zeng, Fandong Meng, and Jie Zhou. 2024. BranchNorm: Robustly Scaling Extremely Deep Transformers. In Findings of the Association for Computational Linguistics ACL 2024, pages 11675–11687, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
BranchNorm: Robustly Scaling Extremely Deep Transformers (Liu et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.695.pdf