%0 Conference Proceedings
%T Optimizing Deeper Transformers on Small Datasets
%A Xu, Peng
%A Kumar, Dhruv
%A Yang, Wei
%A Zi, Wenjie
%A Tang, Keyi
%A Huang, Chenyang
%A Cheung, Jackie Chi Kit
%A Prince, Simon J.D.
%A Cao, Yanshuai
%Y Zong, Chengqing
%Y Xia, Fei
%Y Li, Wenjie
%Y Navigli, Roberto
%S Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
%D 2021
%8 August
%I Association for Computational Linguistics
%C Online
%F xu-etal-2021-optimizing
%X It is a common belief that training deep transformers from scratch requires large datasets. Consequently, for small datasets, people usually use shallow and simple additional layers on top of pre-trained models during fine-tuning. This work shows that this does not always need to be the case: with proper initialization and optimization, the benefits of very deep transformers can carry over to challenging tasks with small datasets, including Text-to-SQL semantic parsing and logical reading comprehension. In particular, we successfully train 48 layers of transformers, comprising 24 fine-tuned layers from pre-trained RoBERTa and 24 relation-aware layers trained from scratch. With fewer training steps and no task-specific pre-training, we obtain the state of the art performance on the challenging cross-domain Text-to-SQL parsing benchmark Spider. We achieve this by deriving a novel Data dependent Transformer Fixed-update initialization scheme (DT-Fixup), inspired by the prior T-Fixup work. Further error analysis shows that increasing depth can help improve generalization on small datasets for hard cases that require reasoning and structural understanding.
%R 10.18653/v1/2021.acl-long.163
%U https://aclanthology.org/2021.acl-long.163
%U https://doi.org/10.18653/v1/2021.acl-long.163
%P 2089-2102