NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

Large-scale language model pretraining is a very successful form of self-supervised learning in natural language processing, but it is increasingly expensive to perform as the models and pretraining corpora have become larger over time. We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than 2x. NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining, rather than all of the tokens as with the usual transformer encoder. We also show that NarrowBERT increases the throughput at inference time by as much as 3.5x with minimal (or no) performance degradation on sentence encoding tasks like MNLI. Finally, we examine the performance of NarrowBERT on the IMDB and Amazon reviews classification and CoNLL NER tasks and show that it is also comparable to standard BERT performance.


Introduction
Pretrained masked language models, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), and DeBERTa (He et al., 2021), have pushed the state-of-the-art in a wide range of downstream tasks in natural language processing. At their core is the transformer architecture (Vaswani et al., 2017) that consists of interleaved self-attention and feedforward sublayers. Since the former sublayer implies quadratic time complexity in the input sequence length (Vaswani et al., 2017), many have proposed methods to make the self-attention computation more efficient (Katharopoulos et al., 2020;Choromanski et al., 2021;Wang et al., 2020;Peng et al., 2021Peng et al., , 2022.
In this work, we explore an orthogonal approach to efficiency: can we make masked language models efficient by reducing the length of the input se-quence that each layer needs to process? In particular, pretraining by masked language modeling only involves prediction of masked tokens (typically, only 15% of the input tokens; Devlin et al., 2019;Liu et al., 2019). Despite this sparse pretraining objective, each transformer layer computes a representation for every token. In addition to pretraining, many downstream applications only use a single vector representation (i.e., only the [CLS] token) for prediction purposes, which is much smaller than the number of input tokens (e.g., sequence classification tasks as in GLUE/SuperGLUE; Wang et al., 2018Wang et al., , 2019. By narrowing the input sequence for transformer layers, we can accelerate both pretraining and inference. We present NarrowBERT, a new architecture that takes advantage of the sparsity in the training objective. We present two NarrowBERT methods in the sections that follow (Figure 1). We provide the code to reproduce our experiments at redacted-during-review. The first method reduces the input sequence for the feedforward sublayers by reordering the interleaved self-attention and feedforward sublayers in the standard transformer architecture (Press et al., 2020): after two standard, interleaved transformer layers, selfattention sublayers are first applied, followed only by feedforward sublayers. This way, the feedforward sublayer computations are only performed for masked tokens, resulting in a 1.3× speedup in pretraining ( §3). The second approach reduces the input length to the attention sublayers: queries are only computed for masked tokens in the attention mechanism (Bahdanau et al., 2015), while the keys and values are not re-computed for non-masked tokens, which leads to a greater than 2× speedup in pretraining.
We extensively evaluate our efficient pretrained models on well-established downstream tasks (e.g., Wang et al., 2018;Tjong Kim Sang and De Meulder, 2003.) We find that our modifications result in almost no drop in downstream performance, while providing substantial pretraining and inference speedups ( §3). While efficient attention variants are promising research directions, this work presents a different and simple approach to making transformers efficient, with minimal changes in architecture.

NarrowBERT
In Figures 1b and 1c, we illustrate two variations of NarrowBERT. We define some notation to describe the configuration of our models. s refers to a single self-attention layer and f refers to a single feedforward layer. The colon : refers to the 'narrowing' operation, which gathers the masked positions from the output of the previous layer.
The first variation ('ContextFirst' in Fig. 1b) uses attention to contextualize all-at-once at the beginning of the model. In short, the transformer layers have been rearranged to frontload the attention components. The example given in the figure specifies the model as sf{5,s}:{5,f}, which means that the input sentence is encoded by a selfattention layer, a feedforward layer, and 5 consecutive self-attention layers. At that point, the masked positions from the encoded sentence are gathered into a tensor and passed through 5 feedforward layers, thereby avoiding further computations for all non-masked tokens. Finally, the masked positions are unmasked and the MLM loss is computed.
The second variation ('SparseQueries' in Fig. 1c) does not reorder the layers at all. Instead, the sf:{5,sf} model contextualizes the input sentence in a more limited way. As shown in Figure  2, the input sentence is first contextualized by a s and a f layer, but the non-masked tokens are never contextualized again afterwards. Only the masked tokens are contextualized by the remaining {5,sf} layers.
Since the masked tokens are only about 15% of the total sentence length, the potential speedup is~6.6× for every feedforward or attention layer downstream of a narrowing : operation. The memory usage can also decrease by~6.6× for those layers since the sequence length has decreased, which allows us to use larger batch sizes during training.
For GLUE, Amazon, and IMDB text classification tasks, only the [CLS] token is used for prediction. When we finetune or predict with ContextFirst on a GLUE/Amazon/IMDB task, the feedforward layers only need to operate on the [CLS] token. When we finetune or predict with SparseQueries,
In our experiments, we use 15% masking in masked language model (MLM) training. Following Liu et al. (2019), we do not use next sentence prediction as a pretraining task. We use large batch sizes and high learning rates to fully utilize GPU memory, as suggested in Izsak et al. (2021). Batches are sized to be the largest that fit in GPU memory. We use a learning rate of 0.0005. Models are trained for 70k steps, where each step contains 1728 sequences of 512 tokens, and gradient accumulation is used to accumulate the minibatches needed per step. Models were trained on hosts with 8 Nvidia A100 GPUs. We used the Hugging Face implementations of the baseline BERT and Funnel Transformer models. We pretrained the baseline BERT, Funnel Transformer, and Narrow-BERT models using the same Wikipedia and Books corpora and total number of steps.
In Figure 3, we see the evolution of the development MLM loss over the course of model training. The BERT and NarrowBERT models all converge to similar values, with the NarrowBERT models reaching a slightly higher MLM loss near the end of training.
In Table 1, we present the results for our extrinsic evaluation on various GLUE tasks. The reduction in performance is small or non-existent, and on WNLI, the NarrowBERT variations perform better than the baseline. For SparseQueries, it is clear that using more layers prior to the narrowing operation improves performance, though the training and inference speedups become smaller. We note that the Funnel Transformer implementation in Pytorch is slower than the baseline BERT model; this may be due to the fact that the original implementation was written in Tensorflow and optimized for Google TPUs. 1 In Table 2, we provide results on the IMDB and Amazon reviews classification tasks and the CoNLL NER task. Generally, NarrowBERT is close to the baseline in performance, and the SparseQueries performance improves as more layers are used before the narrowing operation.
It is well known that the variability in the performance of BERT on certain GLUE tasks is extreme (Mosbach et al., 2020;Dodge et al., 2020;, where the differences in perfor-mance between finetuning runs can exceed 20% (absolute). We have also observed this extreme variability in the course of our own GLUE finetuning experiments. While many techniques have been proposed to address this issue, it is not the goal of this work to apply finetuning stabilization methods to maximize BERT's performance. For this reason, we have excluded the RTE, MRPC, and COLA tasks (which are high-variance tasks studied in the aforementioned papers) from our evaluation.

Discussion and Conclusion
We have explored two straightforward ways of exploiting the sparsity in the masked language model loss: rearranging the layers of the transformer encoder to allow the feedforward components to avoid computations on the non-masked positions, and sparsifying the queries in the attention mechanism to only contextualize the masked positions. The NarrowBERT variants can speed up training by a factor of~2× and inference by a factor of 3×, while maintaining very similar performance on GLUE, IMDB, Amazon, and CoNLL NER tasks. Based on the favorable trade-off between speed and performance seen in Section 3, we recommend that practitioners consider using the SparseQueries NarrowBERT model with 2 or 3 layers before narrowing.

Limitations
Due to our budget constraint, we only performed pretraining and downstream experiments with basesized transformer models. We also only applied the masked language modeling objective, but there are other effective pretraining objectives (e.g., Clark et al., 2020). Nonetheless, since we introduced minimal changes in architecture, we hope that subsequent work will benefit from our narrowing operations and conduct a wider range of pretraining and downstream experiments. While pretrained models can be applied to even more downstream tasks, we designed a reasonable task suite in this work, consisting of both GLUE sentence classification and the CoNLL NER sequential classification tasks.