On the Role of Bidirectionality in Language Model Pre-Training

Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot priming and fine-tuning. We propose a new framework that generalizes prior approaches, including fully unidirectional models like GPT, fully bidirectional models like BERT, and hybrid models like CM3 and prefix LM. Our framework distinguishes between two notions of bidirectionality (bidirectional context and bidirectional attention) and allows us to control each of them separately. We find that the optimal configuration is largely application-dependent (e.g., bidirectional attention is beneficial for fine-tuning and infilling, but harmful for next token prediction and zero-shot priming). We train models with up to 6.7B parameters, and find differences to remain consistent at scale. While prior work on scaling has focused on left-to-right autoregressive models, our results suggest that this approach comes with some trade-offs, and it might be worthwhile to develop very large bidirectional models.


Introduction
NLP has undergone a paradigm shift driven by pretrained models like GPT and BERT (Bommasani et al., 2021).These models are trained on unlabeled corpora in a self-supervised fashion, and can be effectively adapted to downstream tasks either through conventional fine-tuning (Devlin et al., 2019) or few-shot priming (Brown et al., 2020).
Despite their widespread use, there is not a universal formula to pre-train language models: prior work has explored different architectures and learning objectives, often focusing on different applications.For instance, BERT (Devlin et al., 2019) pre-trained masked language models for NLU finetuning, BART (Lewis et al., 2020) pre-trained seq2seq models on denoising for both NLU and generation tasks, and GPT-3 (Brown et al., 2020) scaled autoregressive language models focusing on zero-and few-shot priming.However, such models differ on many factors in addition to their architecture and learning objective (e.g., the pretraining data, compute and hyperparameters), making a principled comparison difficult.Motivated by that, Raffel et al. (2020) presented a comprehensive study exploring various pre-training objective and architecture variants in a controlled environment.However, they conducted most of the exploration using small models, while recent work has found that different approaches behave differently at scale (Tay et al., 2022a,b), and their evaluation was limited to fine-tuning.
In this paper, we focus on a key factor that differentiates many pre-training approachesbidirectionality-and study it in different settings as a function of scale.We propose a new framework that distinguishes between two notions of bidirectionality: bidirectional context (whether the prediction of a given token is conditioned on both the right and the left context, or only on either of them), and bidirectional attention (whether there are blocks of tokens that can all attend to each other, contrasting with triangular attention masking).Our framework offers knobs to control each of them separately, generalizing several previous approaches (e.g.BERT leverages both types of bidirectionality, GPT does not use any, prefix LMs only leverage bidirectional attention, and CM3 only leverages bidirectional context).
We train a total of 24 models covering 6 variants of our framework and 5 model sizes with up to 6.7B parameters, and evaluate them on 4 settings: language modeling, text infilling, zero-shot priming, and fine-tuning.We find that bidirectional attention and context have a different impact depending on Starting from the original document, we mask n mask tokens at random and move them-along with their positional embeddings-to the end.We define our loss over the last n predict tokens, predicting the masked token for the last n mask , and the next token for the remaining n predict − n mask .We use bidirectional attention over the first n bidir tokens, and unidirectional attention over the rest.Refer to Appendix A for a more detailed description.1: Variants of the proposed framework explored in this work.n denotes the document length; B(n, p) denotes the binomial distribution; U(a, b) denotes the discrete uniform distribution.† We set n bidir = 0 and n mask = 0 with probability p = 0.1, so that the model gets more exposure to regular language modeling.the use case, and there is not a single configuration that is optimal for all scenarios.Moreover, we find this behavior to remain consistent at the scale range considered in this study.With recent scaling work focusing on fully unidirectional models, this suggests that there is potential for alternative architectures and learning objectives that might be better suited for other use cases.

Proposed framework
As illustrated in Figure 1, we propose a generalized framework to pre-train transformer models on unlabeled corpora.Our framework supports both unidirectional and bidirectional attention, as well as next token prediction and single-token infilling, using the following parameters to balance them: • n bidir controls the length of the prefix using bidirectional attention, whereas the rest of the document uses unidirectional attention.More concretely, we set the attention mask so that the ith token can attend to the jth token if and only if j ≤ max(i, n bidir ).
• n mask controls how many tokens are masked.Masked tokens are moved to the end along with their positional embeddings.
• n predict controls the length of the suffix for which we define our supervisory signal.We use the cross-entropy loss to train the model, predicting the masked tokens for the last n mask , and the next token for the remaining n predict − n mask .1 As such, our framework allows us to vary the two notions of bidirectionality discussed above: n bidir controls the weight of bidirectional attention, whereas n mask and n predict control the weight of bidirectional context.In addition, larger values of n predict result in more tokens of supervision.Table 1 summarizes the specific variants of this general framework that we explore in our experiments, along with a descriptive name that we will use to refer to each of them.Some variants are equivalent or closely related to existing approaches.In particular, NXTUNI is equivalent to conventional autoregressive language models, and NXTPRE is equivalent to prefix language models.MSKBI is closely related to the RoBERTa objective,2 except that we do not replace 10% of the masked tokens with the original or a randomly picked one.HY-BUNI is similar to the CM3 objective, except that we mask individual tokens instead of spans and we draw the number of masks from a binomial distribution.Finally, we introduce MSKUNI as a variant of MSKBI using unidirectional attention (or, from another perspective, a variant of HYBUNI predicting masked tokens alone), and HYBPRE as a variant of HYBUNI using a bidirectional attention prefix.
3 Experimental settings
Our implementation is based in fairseq (Ott et al., 2019).We apply the procedure described in §2 to each document separately, and combine multiple documents into a single sequence to speed up train-ing. 3As such, we move the masked tokens to the end of each document (as opposed to the end of the whole sequence), and apply a bidirectional attention prefix to each document rather than the sequence as a whole.4

Evaluation
We evaluate our models in the following settings: Language modeling.We evaluate the ability of our models to predict the next token in a sequence as measured by perplexity. 5Different from training, we do not concatenate different documents into the same sequence, and instead score each document as a separate sequence. 6Given that NXTPRE and HYBPRE are primarily trained to predict the last part of a document conditioned on the first part, we also measure the perplexity at predicting the last 20% tokens in each document conditioned on the first 80%.So as to understand whether using bidirectional attention in the prefix is useful to that end, we try different values of n bidir according to a ratio r bidir , so that n bidir = r bidir × n prefix and n prefix = 0.8n is the length of the prefix we are conditioning on.
Single token infilling.We mask a single word in each document at random, and measure the accuracy at predicting it. 7To that end, we use the same procedure used for training (illustrated in Figure 1), which moves the mask token to the end of the sequence.8This approach is not suitable for models trained exclusively on next token prediction like NXTUNI and NXTPRE, as they can only be conditioned on the right context.However, one can still use such models for infilling in a generative fashion, replacing the masked token with each element in the vocabulary, scoring the resulting sequences autoregressively, and predicting the token yield- ing the highest scoring sequence.In addition to our primary evaluation, we compare both of these approaches, which we refer to as infill (direct infilling) and full (full sequence scoring).Given that full can be prohibitively expensive when considering the full vocabulary, we constrain the set of options to the top 32 candidates generated by the 125M MSKBI model. 9 Zero-shot priming.We evaluate our models on zero-shot priming using the exact same settings and tasks as Artetxe et al. ( 2021), which comprises ReCoRD (Zhang et al., 2018), Hel-laSwag (Zellers et al., 2019), PIQA (Bisk et al., 2020), WinoGrande (Sakaguchi et al., 2020), Sto-ryCloze (Mostafazadeh et al., 2016) and Open-BookQA (Mihaylov et al., 2018).These are all multiple choice tasks, so we score the populated prompt corresponding to each option in an autoregressive fashion and predict the highest scoring one. 10However, when the options differ in a single token-as it is common for classification tasks with single-token verbalizers-one can also score such token directly in an infilling fashion.So as to understand how both approaches compare, we further evaluate our models on MNLI (Williams   9 The top 32 candidates contain the correct one in 95.19% of the cases, which is the upper bound accuracy in this setting. 10Refer to Artetxe et al. ( 2021) for a description of the scoring function used for each task and the evaluatio protocol.et al., 2018), using a single-token verbalizer placed in the middle of the prompt.11Fine-tuning.We experiment with the following tasks from GLUE (Wang et al., 2019): COLA (Warstadt et al., 2019), MNLI-m (Williams et al., 2018), MRPC (Dolan and Brockett, 2005), QNLI (Rajpurkar et al., 2016), RTE (Dagan et al., 2006;Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009) and SST-2 (Socher et al., 2013).Our fine-tuning approach closely follows BERT and similar models: we place a special </s> token at the end of the sequence (analogous to the special <CLS> token used by BERT) and learn a new classification head on top.We ran a grid search with the learning rate in {1e-0.5, 2e-05, 5e-05, 5e-06} and batch size in {16, 32, 64}, and report the best development accuracy for each model.The rest of hyperparameters follow RoBERTa.For all variants, we tried fine-tuning both with fully unidirectional attention (r bidir = 0) and fully bidirectional attention (r bidir = 1).Refer to Appendix B for more details.Table 4: Suffix perplexity.We measure perplexity at predicting the last 20% of the tokens in each document conditioned on the first 80%, using n bidir = r bidir × n prefix for inference, where n prefix = 0.8n is the length of the prefix we are conditioning on.

Results
We visualize our main results in Figure 2, and discuss each setting in more detail next.

Language modeling
We report full document perplexities in Table 3. NXTUNI obtains the best results, followed by HY-BUNI and HYBPRE, and NXTPRE doing slightly better than HYBUNI at small scale.This is consistent with how close the pre-training objective is to the end task: NXTUNI is exclusively trained on next token prediction, HYBUNI combines it with masking (which is not used here), and HYBPRE further combines it with a bidirectional attention prefix (which is not used here either).However, it is interesting that scaling up does not reduce the gap between them.This suggests that there is some fundamental interference between these different capabilities, 12 and increasing capacity does not mit- 12 There are various factors that could explain this.Both masking and the bidirectional attention prefix reduce the supervision on next token prediction, and masking further introduces some noise in the original sequence.Moreover, training to use both unidirectional and bidirectional attention and/or context might provide a conflicting signal, although our results later in §4.2 suggest that this does not have a major impact at igate it.Table 4 reports suffix perplexity results, where we predict the last 20% of the tokens in each document conditioned on the rest.Compared to the previous results, NXTPRE and HYBPRE reduce the gap with NXTUNI and HYBUNI, but they still lag behind them.In both cases, we find that the models benefit from using bidirectional attention in the prefix at inference time (i.e., higher values of r bidir yield lower perplexity), but the improvement is relatively small.It is intriguing that NXTUNI outperforms NXTPRE, when the latter was trained on suffix prediction and can leverage bidirectional attention.We attribute this to the bidirectional prefix reducing the number of tokens of supervision during training.

Single token infilling
We report infilling results in Table 5. MSKBI obtains the best results, which can be explained by its use of bidirectional attention and the fact that it is exclusively trained on masking.Our results suggest that both of these factors play a role, but their impact varies at scale.As for the first factor, we find that bidirectional attention has a larger impact on infilling compared to next token prediction ( §4.1), as reflected by MSKBI doing substantially better than MSKUNI.Moreover, we find that this also holds at scale, as reflected by HYBPRE doing better with larger values of r bidir , while outperforming HYBUNI.Regarding the second factor, we find that combining masking with next token prediction significantly hurts infilling performance for small models, as reflected by the large gap between MSKUNI and HYBUNI.However, we also find the impact of this to vanish at scale, as reflected by the gap between MSKBI and HYBPRE with r bidir = 1.0 becoming smaller for larger models.This also explains why HYBPRE with r bidir = 0.0 outperforms HYBUNI for small models, but the trend is reversed as we scale up: the bidirectional prefix in HYBPRE reduces the relative weight of next token prediction during training, which outweighs the discrepancy with not using bidirectional attention at inference time for small models, but not for larger ones.Interestingly, this is different from the behavior observed for language modeling in §4.1, where scale did not significantly mitigate the negative impact of combining masking and next token prediction during training.We attribute this to masking introducing noise in the original document, as well as reducing the amount of tokens that we train on next token prediction. 13 Table 6 reports infilling results re-ranking the top 32 candidates from the 125M MSKBI model.The best results are still obtained by MSKBI, but we find the generative approach described in §3.2 to be competitive, with NXTUNI obtaining the second best results at 125M and the third best results for larger models.This suggests that models trained exclusively on next token prediction can also be used for infilling as long as the set of candidates is small, even outperforming hybrid models like HYBUNI that are trained both on next token prediction and infilling itself.In fact, it is remarkable that NXTUNI is only outperformed by models us-13 Note that the reverse is not true: the addition of next token prediction in HYBUNI does not reduce the amount of supervision on infilling with respect to MSKUNI, as we use the same value of n mask in both cases.NXTUNI 66.7 32.2 65.3 51.9 64.3 33.0 52.3 NXTPRE 65.8 31.2 64.1 54.1 63.5 35.0 52.3 HYBUNI 65.4 30.8 63.1 50.9 63.6 34.4 51.4 HYBPRE 64.9 30.5 64.2 51.9 63.0 35.2 51.6 355M NXTUNI 74.8 41.0 69.5 52.2 70.0 38.6 57.7 NXTPRE 74.3 40.0 68.9 52.6 69.2 37.8 57.1 HYBUNI 73.9 39.3 68.1 52.3 69.3 37.2 56.7 HYBPRE 72.9 37.8 67.6 50.4 68.4 37.4  ing bidirectional attention which, consistent with our previous results, seems strongly beneficial for infilling.Nevertheless, we also find direct infilling (infill) to scale better than generative full sequence scoring (full) for both HYBUNI and HYBPRE, although this could (partly) be explained by the interference between next token prediction and masking diminishing at scale as discussed previously.

Zero-shot priming
We report zero-shot priming results in Table 7.We observe the same general trends as in language modeling ( §4.1), with NXTUNI performing best, followed by HYBUNI and HYBPRE.The results are generally consistent across tasks.with the intrinsic evaluation in §4.2, we find full sequence scoring with NXTUNI to be competitive with direct infilling with MSKBI.In fact, full sequence scoring does even better comparatively, obtaining the best results in all but one of the model sizes.Moreover, it is remarkable that both HY-BUNI and HYBPRE obtain better results with full sequence scoring compared to direct infilling in all cases.Consistent with our previous results, this suggests that left-to-right language models can be a valid or even superior alternative to masked language models for single-token infilling tasks, as long as one can afford scoring each candidate separately.

Fine-tuning
We report average fine-tuning results comparing unidirectional and bidirectional attention in Table 9, and full results for the optimal setting for each variant in Table 10.
Our results show that bidirectional attention is helpful for fine-tuning regardless of scale, with fully bidirectional models (MSKBI) performing the best, followed by models pre-trained with a bidirectional attention prefix (HYBPRE, NXTPRE), and fully unidirectional models performing the worst (HYBUNI, NXTUNI, MSKUNI).Interestingly, changing the attention type at fine-tuning time (using unidirectional attention for pre-training and bidirectional attention for fine-tuning, or the other way around) works poorly.
At the same time, we find that the role of bidirectional context is dependant on the type of attention used.When using fully unidirectional attention, bidirectional context has no clear impact, with NX-TUNI and HYBUNI performing similarly.In contrast, when using bidirectional attention, bidirectional context seems beneficial, with HYBPRE performing better than NXTPRE at small scale.This suggests that pre-training with bidirectional context is important for the model to learn to make effective use of bidirectional attention.

Related work
While it was once common to use random initialization for supervised learning, a series of works showed substantial improvements from pretraining autoregressive models on next token prediction (Dai and Le, 2015;Peters et al., 2018;Howard and Ruder, 2018;Radford et al., 2018).The pre-train/fine-tune paradigm was further popularized by BERT (Devlin et al., 2019) and its derivatives like RoBERTa (Liu et al., 2019), which obtained further gains from pre-training bidirectional encoders on masked language modeling.Subsequent work explored masking spans instead of individual tokens, using either bidirectional encoderonly models (Joshi et al., 2020) or encoder-decoder models (Lewis et al., 2020;Raffel et al., 2020).More recently, there has been a reborn interest on scaling left-to-right autoregressive language models with a focus on few-shot priming (Radford et al., 2019;Brown et al., 2020;Rae et al., 2021;Hoffmann et al., 2022;Smith et al., 2022;Chowdhery et al., 2022;Zhang et al., 2022).
While unidirectional and bidirectional models have largely been developed as separate strains of work serving a different purpose, there have also been some attempts to combine the best of both worlds.XLNet (Yang et al., 2019) pre-trained autoregressive models over all permutations of the factorization order, enabling the model to use bidirectional context with strong results on fine-tuning.Similarly, CM3 (Aghajanyan et al., 2022) trained left-to-right autoregressive models, masking some spans that are predicted at the end of the sequence.ERNIE 3.0 (Sun et al., 2021) proposed a modular architecture, combining a shared unidirectional module with either another unidirectional module for NLG or a bidirectional module for NLU.Finally, Raffel et al. (2020) and Wu et al. (2021) explored splitting documents in two halves and predicting the second one conditioned on the first one, using unidirectional attention for the former and bidirectional attention for the latter.
Despite  2022) conduct a comprehensive study with a focus on zero-shot learning and multi-task fine-tuning.In contrast, we focus on the specific role of bidirectionality, and compare models of different sizes.

Conclusions
In this work, we study the role of bidirectionality in language model pre-training through a new framework that generalizes previous approaches.Our main findings are as follows: • Bidirectional attention is strongly beneficial for infilling and fine-tuning.In contrast, prefix language models lag behind regular language models on next token prediction, even if they get a small benefit from leveraging bidirectional attention in the prefix.This behavior is consistent at scale.
• Models trained jointly to use unidirectional and bidirectional context, like HYBUNI, lag behind regular language models on next token prediction, and scale does not mitigate this.Such models also lag behind pure masked language models on infilling, but scale does help close this gap as long as they are trained with a bidirectional attention prefix.For fine-tuning, bidirectional context is beneficial when used in conjunction with bidirectional attention, but not when used with unidirectional attention.
• While direct infilling requires bidirectional context and benefits from bidirectional attention as discussed above, models using unidirectional context and attention are also competitive in infilling when one can separately score each candidate.For settings where the set of candidates is small (e.g., zero-shot priming for classification), regular language models obtain comparable or even superior results to models pre-trained on infilling.
All in all, our results show that there is not a single configuration that is optimal for all use cases, and this remains generally consistent within the scale range explored in this work.While prior work on scaling has focused on left-to-right autoregressive models, this suggests that there might be other objectives and architectures that are better suited for other applications like fine-tuning.Given the cost of pre-training several models, we would like to explore modular (Sun et al., 2021) or adaptation (Wang et al., 2022) approaches in the future, where one would either have a single model with modular components specialized for different use cases, or efficiently adapt an existing model by changing the parameters in our framework instead of training several models from scratch.

Limitations
Our study focuses on the role of bidirectionality on language model pre-training, and does not explore other factors that might affect model performance.
In particular, we mask individual tokens without considering longer spans, and do not explore the impact of the masking rate.In addition, we do not consider sequence-to-sequence models in our study, which combine bidirectional attention in the encoder and unidirectional attention in the decoder.Finally, we train all variants for the same number of tokens, making them comparable in terms of training cost, but resulting in models using a bidirectional attention prefix or a masking objective seeing less tokens of supervision.We start with the original sequence in the input, and predict the next token in the output; 2) We choose n mask tokens at random, replace them with the special <mask> token in the input, and predict the masked token (rather than the next token) in the output; 3) We move the masked tokens and their corresponding positional embeddings to the end; 4) We only predict the last n predict tokens, using bidirectional attention for the first n bidir tokens and unidirectional attention for the rest (final objective).

Figure 1 :
Figure1: Proposed framework.Starting from the original document, we mask n mask tokens at random and move them-along with their positional embeddings-to the end.We define our loss over the last n predict tokens, predicting the masked token for the last n mask , and the next token for the remaining n predict − n mask .We use bidirectional attention over the first n bidir tokens, and unidirectional attention over the rest.Refer to Appendix A for a more detailed description.

Figure 2 :
Figure 2: Main results.Unidir and Bidir denote using n bidir = 0 and n bidir = n after pre-training, respectively (or n bidir = n prefix for suffix perplexity).

Figure 3 :
Figure3: Proposed framework.1) We start with the original sequence in the input, and predict the next token in the output; 2) We choose n mask tokens at random, replace them with the special <mask> token in the input, and predict the masked token (rather than the next token) in the output; 3) We move the masked tokens and their corresponding positional embeddings to the end; 4) We only predict the last n predict tokens, using bidirectional attention for the first n bidir tokens and unidirectional attention for the rest (final objective).

Table 5 :
Single token infilling accuracy.We mask a random token in each validation document and measure the accuracy at predicting it, using n bidir = r bidir × n for inference.

Table 6 :
Single token infilling accuracy, re-ranking the top 32 candidates from 125M MSKBI.† denotes n bidir = n, the rest use n bidir = 0. Refer to §3.2 for more details.

Table 9 :
Table 8 reports MNLI results, comparing full sequence scoring and direct infilling.Consistent Average fine-tuning accuracy.

Table 10 :
Fine-tuning accuracy.We use n bidir = 0 for NXTUNI, MSKUNI and HYBUNI, and n bidir = n for the rest.