How to Train BERT with an Academic Budget

While large language models a la BERT are used ubiquitously in NLP, pretraining them is considered a luxury that only a few well-funded industry labs can afford. How can one train such models with a more modest budget? We present a recipe for pretraining a masked language model in 24 hours using a single low-end deep learning server. We demonstrate that through a combination of software optimizations, design choices, and hyperparameter tuning, it is possible to produce models that are competitive with BERT-base on GLUE tasks at a fraction of the original pretraining cost.


Introduction
Large language models, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and GPT3 (Brown et al., 2020), have become the de facto models used in many NLP tasks. However, their pretraining phase can be prohibitively expensive for startups and academic research groups, limiting the research and development of model pretraining to only a few well-funded industry labs. How can one train a large language model with commonlyavailable hardware in reasonable time?
We present a recipe for training a BERT-like masked language model (MLM) in 24 hours in a limited computation environment. Our approach combines multiple elements from recent work: faster implementation (Rasley et al., 2020), faster convergence through over-parameterization (Li et al., 2020b), best practices for scaling language models (Kaplan et al., 2020), single-sequence training (Joshi et al., 2020;Liu et al., 2019), and more. Moreover, we conduct an extensive hyperparameter search tailored to our resource budget, and find that synchronizing learning rate warmup and decay schedules with our 24 hour budget greatly improves model performance.
When evaluating on GLUE (Wang et al., 2018), our recipe produces models that are competitive with BERT BASE -a model that was trained on 16 TPUs for 4 days. This recipe can also be applied to other corpora, as we demonstrate by training a French-language model on par with CamemBERT BASE (Martin et al., 2020) on the XNLI French benchmark (Conneau et al., 2018). Overall, our findings demonstrate that, with the right recipe and an understanding of the available computational resources, large language models can indeed be trained in an academic setting.

Problem Setup
We investigate the task of pretraining a large language model under computational constraints. To simulate an academic computation budget, we limit the training time to 24 hours and the hardware to a single low-end deep learning server. 2 Using current cloud-compute prices, we estimate the dollar-cost of each training run at around $50 to $100.
Under these constraints, our goal is to pretrain a model that can benefit classification tasks, such as in GLUE (Wang et al., 2018). Therefore, we follow the standard practice and focus on BERT-style transformer encoders trained on the MLM objective (Devlin et al., 2019). We retain the standard pretraining corpus of English Wikipedia and the Toronto BookCorpus (Zhu et al., 2015), containing 16GB of text, tokenized into subwords using BERT's uncased tokenizer.

Combining Efficient Training Methods
To speed up our training process, we combine a variety of recent techniques for optimizing a masked language model. To the best of our knowledge, this is the first time that such techniques are combined and evaluated as a unified framework for training large models with limited computational resources.

Methods
Data Since our focus is mainly on sentence classification tasks, we limit sequences to 128 tokens for the entire pretraining process. Devlin et al. (2019) also apply this practice to 90% of the training steps, and extend the sequence to 512 tokens for the last 10%. This increases sample efficiency by reducing padding, and also allows us to fit a larger model into memory (see Model). In addition, we use single-sequence training without the next sentence prediction (NSP) objective, which was shown to benefit optimization (Joshi et al., 2020;Liu et al., 2019). To maximize time spent on training, we hold out only 0.5% of the data and compute the validation-set loss every 30 minutes.
Model Recent work has found that larger models tend to achieve better performance than smaller models when trained for the same wall-clock time (Li et al., 2020b;Kaplan et al., 2020). We adopt these recommendations and train a BERT LARGE model: 24 layers, 1,024 dimensions, 16 heads, 4,096 hidden dimensions in the feed-forward layer, with pre-layer normalization (Shoeybi et al., 2019). The purpose of applying the "train large" approach is not to compete with fully-trained extra-large models, but to train the best model we can, regardless of size, given the computational constraints (Section 2).
Software We base our implementation on the DeepSpeed software package (Rasley et al., 2020), which includes optimizations for training language models, such as data parallelization, and mixedprecision training. We further improve the implementation by replacing the MLM prediction head with sparse token prediction (Liu et al., 2019), and use fused implementations for all linear-activationbias operations and layer norms, in particular the APEX LayerNorm operation.

Combined Speedup
We compare our optimized framework to the official implementation of Devlin et al. (2019). 3 Table 1 shows that using the official code to train BERT BASE could take almost 6 days under our hardware assumptions (Section 2), and a large model might require close to a month of non-stop computation. In contrast, our recipe significantly speeds up training, allowing one to train BERT LARGE with the original number of steps (1M) in a third of the time (8 days), or converge in 2-3 days by enlarging the batch size. While larger batch sizes do not guarantee convergence to models of equal quality, they are generally recommended (Ott et al., 2018;Liu et al., 2019), and present a more realistic starting point for our next phase (hyperparameter tuning) given our 24-hour constraint.
We also conduct an ablation study of engineering improvements in our model. Table 2 shows that efficient implementation gains an additional 1.75 hours (out of 24) for training operations, which would have otherwise been wasted.

Hyperparameter Search
Calibrating hyperparameters is key to increasing model performance in deep learning and NLP (Levy et al., 2015;Liu et al., 2019). We re-tune core optimization hyperparameters to fit our lowresource setting, rather than the massive computation frameworks for which they are currently tuned. Our hyperparameter search yields substantial improvements in MLM loss after 24 hours of training.

Hyperparameters
Batch Size (bsz) The number of examples (sequences up to 128 tokens) in each mini-batch. We try batch sizes of 4k, 8k, and 16k examples, which are of a similar order of magnitude to the ones used by Liu et al. (2019). Since our hardware has limited memory, we achieve these batch sizes via gradient accumulation. In terms of parameter updates, these batch sizes amount to approximately 23k, 12k, and 6k update steps in 24 hours, respectively.  Peak Learning Rate (lr) Our linear learning rate scheduler, which starts at 0, warms up to the peak learning rate, and then decays back to 0. We try 5e-4, 1e-3, and 2e-3.
Warmup Proportion (wu) We determine the number of warmup steps as a proportion of the total number of steps. Specifically, we try 0%, 2%, 4%, and 6%, which all reflect significantly fewer warmup steps than in BERT.
Total Days (days) The number of days it would take the learning rate scheduler to decay back to 0, as measured on our hardware. This is equivalent to setting the maximal number of steps. Together with the warmup proportion, it determines where along the learning rate schedule the training process stops. For a value of 1 day, the learning process will end when the learning rate decays back to 0. We try setting the schedule according to 1, 3, and 9 days.

Methodology
We optimize our model using MLM loss with each hyperparameter setting. Although there are 108 combinations in total, poor configurations are easy to identify early on. After 3 hours, we prune configurations that did not reach a validation-set loss of 6.0 or less; this rule removes diverging runs, such as configurations with 0% warmup. After 12 hours, we keep the top 50% of models with respect to the validation-set loss, and resume their runs until they reach 24 hours.

Results
We first analyze the effect of each hyperparameter by plotting the distribution of the validation-set loss per value (Figure 1). We observe a clear preference towards synchronizing the learning rate schedule with the actual amount of training time in the budget (1 day), corroborating the results of Li et al. (2020a). We also find the smaller batch size to   have an advantage over larger ones, along with moderate-high learning rates. We suspect that the smaller batch size works better for our resource budget due to the trade-off between number of samples and number of updates, for which a batch size of 4096 seems to be a better fit. Finally, there appears to be a preference towards longer warmup proportions; however, a closer look at those cases reveals that when the number of total days is larger (3 or 9), it is better to use a smaller warmup proportion (2%), otherwise the warmup phase might take up a larger portion of the actual training time. Table 3 shows the best configurations by MLM loss. It is apparent that our calibrated models perform substantially better than models with BERT's default hyperparameters (which were tuned for 4 days on 16 TPUs). There is also relatively little variance in performance among the top models. We select the best model (Search #1), and name it 24hBERT. Figure 2 compares 24hBERT with models using the default calibration, and shows that 24hBERT converges significantly faster.

Downstream Evaluation
We test the performance of our optimized, calibrated 24hBERT model on the GLUE benchmark (Wang et al., 2018). 4 For finetuning, we follow the practice of Liu et al. (2019), and run a grid 4 See Appendix D for a full description of tasks. search over multiple hyperparameters and seeds (see Appendix B), and also use mid-training (Phang et al., 2018) on MNLI for RTE, MRPC and STS-B. Table 4 shows the results on GLUE's test sets. Our 24hBERT model performs on par with BERT BASE on 3 major tasks (MNLI, QNLI, SST-2) and even outperforms it on CoLA. However, 24hBERT reaches slightly lower results on 4 tasks (QQP, RTE, MRPC, STS-B). Overall, this amounts to a small difference on the average score (0.4%), showing that our recipe can indeed produce a model that is largely competitive with BERT BASE , but at a small fraction of its training cost.

Generalizing to New Corpora
Our recipe was calibrated using a particular corpus (English Wikipedia and books), but does it generalize to other corpora as well? We follow Camem-BERT (Martin et al., 2020) and train a masked language model on French Wikipedia, using exactly the same dataset. We then finetune our French 24hBERT on the XNLI French dataset (Conneau et al., 2018), reaching 78.5% accuracy, compared to 79.1% of CamemBERT BASE . This result demonstrates that our recipe can indeed be ported to other corpora as-is, without retuning hyperparameters.

Discussion
Comparison with ELECTRA While Clark et al. (2020) show impressive pretraining speedups with ELECTRA, we argue that having a generative model (MLM or LM) is important nowadays, given the recent rise of few-shot learning and prompting approaches (Schick and Schütze, 2021). To emphasize this point, we run 24hBERT on the SST-2 (Socher et al., 2013) task both with and without prompts in the few-shot setting. Figure 3 shows that there is a significant advantage in the ability to prompt the model, which is perhaps not trivial for non-generative ELECTRA-style models. FLOPs as a Measure of Efficiency While measuring floating point operations is commonly used to compare efficiency in a hardware-agnostic manner, it is not an accurate tool for comparing the actual time (and therefore budget) associated with training a model. Specifically, measuring FLOPs ignores the fact that many operations run in parallel (e.g. via batching), and are thus much less costly in practice (Li et al., 2020b).
Limitations Our investigation is limited to classification tasks. While it is true that it is not fully comparable with BERT BASE in that using short sequences does not allow for reading comprehension tasks (without resorting to sliding windows), it might be possible to continue training the model for a few more hours with sequences longer than 128 tokens, as done by Devlin et al. (2019). We leave such experiments for future work.

Conclusions
We present a recipe for pretraining a masked language model in 24 hours using a low-end deep learning server. We show that by combining multiple efficient training methods presented in recent work and carefully calibrating the hyperparameters it is possible to pretrain a model that is competitive to BERT BASE on GLUE tasks. In contrast to other works in this area, which often focus a single method for improving efficiency, our recipe consists of many different components that together amount to very large speedups: • Short sequences (Devlin et al., 2019) • Single-sequence training (Joshi et al., 2020) • Training larger models (Li et al., 2020b) • DeepSpeed (Rasley et al., 2020) • Sparse token prediction (Liu et al., 2019) • Fused implementations • Avoiding disk I/O • Large batch sizes (Liu et al., 2019) • Large learning rates (Liu et al., 2019) • Short warmup • Synchronizing schedule with time budget (Li et al., 2020a) As with every recipe, our recommendations may need to be adapted to the hardware and time constraints at hand. We hope that our findings allow additional players to participate in language model research and development, and help democratize the art of pretraining. Table 5 presents the full set of hyperparameter configurations we examine in Section 4.

B Finetuning Hyperparameters
Finetuning hyperparameters used for the GLUE benchmark tasks are presented in Table 7. We run each configuration using 5 random seeds and select the median of the best configuration. Table 6 includes time comparison of our 24 hour training setup when using more recent hardware backends.

D Downstream Tasks
MNLI: Multi-Genre Natural Language Inference is a large-scale, crowd-sourced entailment classification task . Given a pair of sentences, we wish to predict whether the second sentence is an entailment, contradiction, or neutral with respect to the first one. QQP: Quora Question Pairs is a binary classification task, where the goal is to determine whether two questions asked on Quora are semantically equivalent or not (Iyer et al., 2016).
QNLI: Question Natural Language Inference is a version of the Stanford Question Answering Dataset (Rajpurkar et al., 2016). It has been converted into a binary classification task (Wang et al., 2018). The positive examples are (question, sentence) pairs, which contain the answer, and the negative examples are from the same paragraph, yet do not contain the answer.
SST-2: The Stanford Sentiment Treebank is a binary single-sentence classification task, consisting of sentences extracted from movie reviews. Their sentiment is based on human annotations (Socher et al., 2013).
CoLA: The Corpus of Linguistic Acceptability is a binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically "acceptable" or not (Warstadt et al., 2019).
STS-B: The Semantic Textual Similarity Benchmark is a collection of sentence pairs, drawn primarily from news headlines, with additional sources as well (Cer et al., 2017). They were annotated with a score from 1 to 5, which denotes   how similar the two sentences are, when semantic meaning is considered. MRPC: Microsoft Research Paraphrase Corpus consists of sentence pairs automatically extracted from online news sources. The human annotations are for whether the sentences in the pair are semantically equivalent (Dolan and Brockett, 2005).