Understanding the Difficulty of Training Transformers

Liyuan Liu; Xiaodong Liu; Jianfeng Gao; Weizhu Chen; Jiawei Han

doi:10.18653/v1/2020.emnlp-main.463

Understanding the Difficulty of Training Transformers

Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, Jiawei Han

Abstract

Transformers have proved effective in many NLP tasks. However, their training requires non-trivial efforts regarding carefully designing cutting-edge optimizers and learning rate schedulers (e.g., conventional SGD fails to train Transformers effectively). Our objective here is to understand __what complicates Transformer training__ from both empirical and theoretical perspectives. Our analysis reveals that unbalanced gradients are not the root cause of the instability of training. Instead, we identify an amplification effect that influences training substantially—for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable, since it amplifies small parameter perturbations (e.g., parameter updates) and results in significant disturbances in the model output. Yet we observe that a light dependency limits the model potential and leads to inferior trained models. Inspired by our analysis, we propose Admin (Adaptive model initialization) to stabilize the early stage’s training and unleash its full potential in the late stage. Extensive experiments show that Admin is more stable, converges faster, and leads to better performance

Anthology ID:: 2020.emnlp-main.463
Volume:: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:: November
Year:: 2020
Address:: Online
Editors:: Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5747–5763
Language:
URL:: https://aclanthology.org/2020.emnlp-main.463
DOI:: 10.18653/v1/2020.emnlp-main.463
Bibkey:
Cite (ACL):: Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. 2020. Understanding the Difficulty of Training Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5747–5763, Online. Association for Computational Linguistics.
Cite (Informal):: Understanding the Difficulty of Training Transformers (Liu et al., EMNLP 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.emnlp-main.463.pdf
Optional supplementary material:: 2020.emnlp-main.463.OptionalSupplementaryMaterial.zip
Video:: https://slideslive.com/38938933
Code: LiyuanLucasLiu/Transforemr-Clinic + additional community code
Data: WMT 2014

PDF Cite Search Code Optional supplementary material Video