Dynamic Masking Rate Schedules for MLM Pretraining

Zachary Ankner, Naomi Saphra, Davis Blalock, Jonathan Frankle, Matthew Leavitt


Abstract
Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model’s fixed masking rate of 15%. We propose to instead dynamically schedule the masking rate throughout training. We find that linearly decreasing the masking rate over the course of pretraining improves average GLUE accuracy by up to 0.46% and 0.25% in BERT-base and BERT-large, respectively, compared to fixed rate baselines. These gains come from exposure to both high and low masking rate regimes, providing benefits from both settings. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models, achieving up to a 1.89x speedup in pretraining for BERT-base as well as a Pareto improvement for BERT-large.
Anthology ID:
2024.eacl-short.42
Volume:
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
March
Year:
2024
Address:
St. Julian’s, Malta
Editors:
Yvette Graham, Matthew Purver
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
477–487
Language:
URL:
https://aclanthology.org/2024.eacl-short.42
DOI:
Bibkey:
Cite (ACL):
Zachary Ankner, Naomi Saphra, Davis Blalock, Jonathan Frankle, and Matthew Leavitt. 2024. Dynamic Masking Rate Schedules for MLM Pretraining. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 477–487, St. Julian’s, Malta. Association for Computational Linguistics.
Cite (Informal):
Dynamic Masking Rate Schedules for MLM Pretraining (Ankner et al., EACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.eacl-short.42.pdf