How to Train BERT with an Academic Budget

Peter Izsak, Moshe Berchansky, Omer Levy


Abstract
While large language models a la BERT are used ubiquitously in NLP, pretraining them is considered a luxury that only a few well-funded industry labs can afford. How can one train such models with a more modest budget? We present a recipe for pretraining a masked language model in 24 hours using a single low-end deep learning server. We demonstrate that through a combination of software optimizations, design choices, and hyperparameter tuning, it is possible to produce models that are competitive with BERT-base on GLUE tasks at a fraction of the original pretraining cost.
Anthology ID:
2021.emnlp-main.831
Volume:
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2021
Address:
Online and Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10644–10652
Language:
URL:
https://aclanthology.org/2021.emnlp-main.831
DOI:
10.18653/v1/2021.emnlp-main.831
Bibkey:
Cite (ACL):
Peter Izsak, Moshe Berchansky, and Omer Levy. 2021. How to Train BERT with an Academic Budget. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10644–10652, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
How to Train BERT with an Academic Budget (Izsak et al., EMNLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.emnlp-main.831.pdf
Video:
 https://aclanthology.org/2021.emnlp-main.831.mp4
Code
 peteriz/academic-budget-bert +  additional community code
Data
CoLAGLUEMRPCMultiNLIQNLIQuora Question PairsRTESSTSST-2STS Benchmark