DecBERT: Enhancing the Language Understanding of BERT with Causal Attention Masks

Since 2017, the Transformer-based models play critical roles in various downstream Natural Language Processing tasks. However, a common limitation of the attention mechanism utilized in Transformer Encoder is that it cannot automatically capture the information of word order, so explicit position embeddings are generally required to be fed into the target model. In contrast, Transformer Decoder with the causal attention masks is naturally sensitive to the word order. In this work, we focus on improving the position encoding ability of BERT with the causal attention masks. Furthermore, we propose a new pre-trained language model DecBERT and evaluate it on the GLUE benchmark. Experimental results show that (1) the causal attention mask is effective for BERT on the language understanding tasks; (2) our DecBERT model without position embeddings achieve comparable performance on the GLUE benchmark; and (3) our modification accelerates the pre-training process and DecBERT w/ PE achieves better overall performance than the baseline systems when pre-training with the same amount of computational resources.


Introduction
In recent years, Transformer model proposed by Vaswani et al. (2017) has supplanted the widelyused LSTM (Hochreiter and Schmidhuber, 1997) as an indispensable component of many NLP systems. There are two branches of model variant: Transformer Encoder and Transformer Decoder. The Encoder-based Language Models, e.g., BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and DeBERTa (He et al., 2020), have achieved great success on many natural language understanding benchmarks (e.g. GLUE (Wang et al., 2019b) and SuperGLUE (Wang et al., 2019a)). The Decoder-based Language Models such as GPTfamily (Radford and Narasimhan, 2018;Radford * equal contribution et al., 2019;Brown et al., 2020) have shown superior performances on natural language generation. All of them utilize the Multi-Head Self-Attention (MHA) mechanism (Vaswani et al., 2017). Since MHA is designed as an order-invariant mechanism , Transformer Encoder without the help of position embeddings should share the same intuitions with the bag-of-word model. On the other hand, in Transformer Decoder, the causal attention masks make the MHA different from that of the Transformer Encoder. Specifically, Tsai et al. (2019) have proved that MHA with such attention masks is not permutation equivalent, indicating that Transformer Decoder is sensitive to word order.
It is noticed that several studies focus on enriching the position information of BERT to improve the performance of natural language understanding (Dai et al., 2019;Dufter et al., 2020;He et al., 2020;Wu et al., 2021a;Ke et al., 2021), e.g., introducing extra learnable parameters to trace the word order. Previous analysis also indicate that the lower layers of BERT tend to capture rich surface-level language structural information such as position information (Jawahar et al., 2019). In this paper, to improve the language understanding of BERT, we propose to enrich the position information in the lower hidden layers instead of introducing extra learnable positional parameters.
To this end, we firstly design analysis experiments to examine the effectiveness of causal attention masks in terms of capturing position information. Then we propose a new pre-trained language model DecBERT by adding the causal attention masks into the lower layers of BERT (e.g., the first two layers) to enhance the position encoding ability. In this way, our proposed model is naturally sensitive to word order. Then we pre-train our DecBERT as a masked language model, following the same objective as BERT. To verify whether our modification can help BERT trace word order, we also make a comparison with a variant of our DecBERT that excludes any position embeddings. The experimental results show that DecBERT w/o PE has 77 times (4.59 vs. 353.97) lower valid PPL score than BERT w/o PE and achieves comparable performance with BERT w/ PE on downstream tasks, corroborating the effectiveness of our modification. Furthermore, DecBERT w/ PE achieves better performances than BERT on most downstream tasks when pre-training with the same amount of time and computational resources. By analyzing the pre-training process, we find that our modification can also accelerate pre-training.
The contributions of this work are summarised as follows: • We propose a novel pre-train model DecBERT utilizing the causal attention masks to enhance language understanding of BERT.
• We show that DecBERT w/o PE has comparable performance with BERT w/ PE, indicating that the causal attention masks are effective for modeling word order.
• When pre-training with the same amount of time and computational resources, DecBERT w/ PE achieves lower validation PPL and better overall performance on GLUE than BERT.

Background: Transformer
Transformer is a neural network model proposed by Vaswani et al. (2017), which relies on the multi-head self-attention (MHA) mechanism.
Input Layer. Due to the order-invariance of MHA , a token embedding is added with a position embedding as the input of Transformer Encoder or Decoder: where x i is a token at the i th position. T E is a token embedding matrix and P E is a position embedding matrix. In the paper of Vaswani et al. (2017), they use a fixed sinusoidal P E: where j is the dimension and d m is the model size.
In the later work, Devlin et al. (2019) choose to use a learnable P E matrix.
Multi-head Self-attention (MHA). MHA takes a sequence of vectors h = [h 1 , h 2 , ..., h n ] as input. Then they are transformed into three different vectors, query (Q), key (K) and value (V), by three linear transformations and passed to the multi-head self-attention (MHA). The computation process of a single head is: (3) where d k is the dimension of a single head. MHA repeats the same process for h heads. The outputs of all heads are concatenated together and passed through a linear projection W O again: Transformer Encoder and Decoder. An Encoder layer consists of multi-head attention following with a feed-forward network (FFN). The outputs of MHA and FFN are passed through a Lay-erNorm (Ba et al., 2016) with residual connections (He et al., 2016). Then we stack multi-layer to form a Transformer Encoder. The difference between Decoder and Encoder is that Decoder uses the causal attention masks to mask the attention values of the subsequent tokens so that Decoder can only decode tokens relying on the tokens in the past. 1

Methodology
In this section, we first analyze the relationship between Transformer Decoder and position embeddings (section 3.1). Based on this analysis, we inject the causal attention masks into BERT to create our new pre-trained language models, DecBERT (section 3.2).

Transformer Decoder and Position Embeddings
Previous studies (Tsai et al., 2019) indicate that Transformer Decoder with causal attention masks is sensitive to word order. We wonder whether Transformer Decoder can perform well without position embeddings. We assume that if Transformer Decoder without any position embeddings still retains comparable performance with its counterpart with position embeddings, it will corroborate that the causal attention masks are helpful for Transformer to encode word order. To this end, we design a straightforward experiment of causal language modeling respectively on English and Chinese data as followed.
Basic Model. Our basic model is an 8-layer Transformer Decoder with 768 embedding size, 3072 feedforward layer hidden size, 12 attention heads and GELU activation function (Hendrycks and Gimpel, 2020), which is a smaller version of GPT and has 95M trainable parameters for English model and 77.5M for Chinese model. 2 We find that if we use a standard 12-layer GPT, the number of trainable parameters will be higher than the number of tokens in the WikiText-103 dataset. This has a risk to cause over-fitting, so we choose to use an 8-layer model.
Data and Training. We resort to two publicly available wikipedia datasets. The first one is the English WikiText-103 (Merity et al., 2017). We train and evaluate our language models on the standard splits of the WikiText-103, which contains 1.8M sentences for training and 3.76k sentences for evaluation. The second one is the Chinese Wikipedia which contains about 9.28M sentences. We randomly select 34k sentences for evaluation and 9.25M for training. We use Fairseq  to pre-process all the data into the binary files. All the English data is tokenized by Senten- cePiece tokenizer (Kudo and Richardson, 2018), which is the same as RoBERTa. All Chinese data is tokenized by character. All models are trained with Fairseq. The training objective is the Causal Language Modeling objective. We use a batch size of 128 and train for 100k steps, optimized by Adam (Kingma and Ba, 2015). We also use the polynomial learning rate decay with 10k warmup steps. All models use the same hyper-parameters. We list the details in the Appendix. We use two NVIDIA A100 40GB GPUs to train each model. For the WikiText-103, it costs about 10 hours per model. For the Chinese Wikipedia, it costs about 8.5 hours per model. Table 1 presents the perplexity (PPL) scores of Transformer Decoders with or without position embeddings on WikiText-103 and Chinese Wikipedia validation sets. Transformer Decoder w/o PE achieves comparable performance with its counterpart with learnable PE, which is only about 0.2 higher. This result reveals that the additional performance gain brought by position embeddings is small. Only relying on its causal attention masks, Transformer Decoder still can perform well. Combing our experiment and the previous studies (Tsai et al., 2019;Irie et al., 2019), the causal attention masks can make Transformer sensitive to word order.

Our DecBERT Model
In section 3.1, we conclude that Transformer with the causal attention masks is naturally sensitive to word order. Since the position information is inevitable for BERT, we propose to enhance existing BERT model based on causal attention masks.
In this paper, we intend to add the causal attention masks into all or some hidden layers of BERT. In this way, the specific layers with such masks are sensitive to word order by design, which can enhance the position encodings ability of BERT. Such framework can further result in better language understanding performances, e.g., in pretrained language modeling, casual attention masks were added on all 12 layers of GPT (Radford and Narasimhan, 2018). However, comparing with BERT (Devlin et al., 2019), we observe that GPT lags behind BERT on almost all downstream tasks. 3 This is because self-attention mechanism with such masks only consider one-side information flow, it cannot process the input sentence comprehensively and has a high risk of language information loss. Therefore, we can conjecture that it is not effective to use the causal attention masks in all hidden layers. There is a strong need to maintain a balance between the gain of position encoding ability and the loss of language information.
In order to determine which layer(s) should add casual attention masks, we refer to the BERTology work (Jawahar et al., 2019) that conduct comprehensive experiments to analyze and interpret the information captured by each layer of BERT. The experimental results indicate that the lower layers of BERT capture rich language structure information. The position information is also a common structure information, so that we propose to add the causal attention masks into the lower layers (e.g., the first two layers 4 ) to improve the position encoding ability of BERT. We denote our model as DecBERT. There are two versions of our model, DecBERT-Same and DecBERT-Diff. All of them are 12-layer base size models.
• DecBERT-Same: This model has a similar structure as BERT (see Figure 1(a)), but we use the causal attention masks to convert the first two Encoder layers into two Decoder layers with the same direction (from left to right). So the 12-layer model has 10 Encoder layers and 2 Decoder layers, which is shown in Figure 1(b). In this way, the first two layers are naturally sensitive to word order; • DecBERT-Diff: This model is designed to enhance DecBERT-Same to gain more language information from different encoding directions. This model has a same structure as DecBERT-Same, except the second Decoder layer that has the opposite direction (from right to left). Figure 1(c) illustrates the model structure.
One would think that DecBERT is similar to Transformer with RNN layer (Neishi and Yoshinaga, 2019). Note that DecBERT is quite different from it, because DecBERT has similar structure as BERT and both of them require the same amount of computational time, which is much faster than that of Transformer with RNN.

Experimental Setup
Our experiments can be separated into two parts, small-scale pre-training scenario and large-scale pre-training scenario. Since the small-scale pretraining consumes much less time and fewer computational resources, we intend to answer several research questions in this part: • Can DecBERT without any position embeddings still understand language well?
• Can DecBERT with position embeddings outperform BERT?
• Is using different directional causal attention masks more helpful than using the same directional?
• Why can DecBERT benefit from the causal attention masks, how do such masks affect the pre-training process?
For the large-scale pre-training scenario, we intend to examine whether the performance gap between our DecBERT and BERT will be diminished after scaling up the pre-training data size and time. Such settings can present a more comprehensive view of whether our modification can benefit the pre-trained language models.
For a fair comparison, we re-implement BERT and pre-train it with the same settings as DecBERT in the small-scale and large-scale pre-training. We denote it as BERT-reImp.
Small-scale Pre-training Scenario. The pretraining data is the widely-used English Wikipedia Corpus. We randomly select 158.4M sentences for training and 50k sentences for validation. The pre-training objective is the Masked Language Modeling objective. We use a batch size of 256 and pre-train for 200k steps, optimized by Adam. All models use the same hyper-parameters. We list the details in the Appendix. We use four NVIDIA A100 40GB GPUs to pre-train each model, costing about 34.5 hours per model.
Large-scale Pre-training Scenario. Limited by time and computational resources, it is impossible for us to pre-train all models in the small-scale pre-training scenario from scratch in this setting. Thus, we decide to pre-train the best model in the small-scale scenario and the baseline model BERT-reImp w/ PE in this part. We use a large amount of pre-training data (around 160GiB 5 ). The batch size is set to 4096 and the pre-training steps are 300k. We pre-train each model with 8 NVIDIA A100 40GB GPUs, costing about 15 days per model. The hyper-parameters details can be also seen in the Appendix.
Fine-tuning. To evaluate the language understanding ability of our models, we fine-tune them with 8 tasks of GLUE benchmark (Wang et al., 2019b)   benchmark. One can notice that our proposed models achieve lower valid PPL scores and higher overall scores on the downstream tasks.

Small-scale Pre-training
Can DecBERT without any position embeddings still understand language well? Since the self-attention of Transformer Encoder is order-invariant, the extra position information is inevitable for it to model language. Otherwise, it just becomes a bag-of-word model. From  Table 2 indicates that DecBERT-Same/Diff w/o PE retains the same level performance as BERT-reImp w/ PE. These results reveal that DecBERT still can understand language well without the help of position embeddings, which is in line with our experimental results in section 3.1.
Can DecBERT with position embeddings outperform BERT? Table 2 shows that both DecBERT-Same and DecBERT-Diff have lower validation PPL scores than BERT-reImp (w/ PE). After fine-tuning on the downstream tasks, Table 3 reveals that they also have better performance on most tasks. These results confirm our belief that   our models can benefit from the causal attention masks. Such masks enhance the position encoding ability of BERT, leading to better language understanding ability.

Is using different directional causal attention masks helpful? The only difference between
DecBERT-Same and DecBERT-Diff is that we adopt a different directional causal attention mask in the second layer. Table 2 shows that DecBERT-Diff w/ PE achieves the lowest validation PPL score (4.07). After fine-tuning on the downstream tasks, it also has the best overall score. These results confirm our belief that DecBERT can benefit from using different directional attention masks. Though the first two layers of DecBERT-Diff only consider one-side information flow, the model can learn to process different directional information in the first two layers. This design maintains a better balance between the gain of position encoding ability and the loss of language information.
Why can DecBERT benefit from the causal attention masks? The experimental results in the previous part indicate that the causal attention masks can increase the model's position encoding ability. Then such ability leads to better lan- guage understanding ability. However, the relation between these two abilities remains unclear. We analyze the pre-training process of our models to give a possible explanation. Our models' pre-training loss curves are presented in Figure 2 and 3. Since the randomly initialized Multi-head Self-Attention of BERT is a "balance" structure without any inductive bias, the model needs to learn suitable position embeddings to trace the word order during pre-training. In Figure 2, one can notice that the pre-training process of BERT-reImp w/ PE can be divided into four stages: (1) starting stage (0-1000 steps), (2) plateau stage (1000-8000 steps), (3) "diving" stage (8000-10000 steps) and (4) convergence stage (10000final steps). In the starting and plateau stages, BERT-reImp w/ PE has almost the same training loss as its counterpart without PE, which indicates that it is still a bag-of-word model and does not know how to make use of the position information. In the "diving" stage, the training loss of BERT-reImp w/ PE decreases rapidly, while BERT-reImp w/o PE starts to converge. This reveals that the word order information becomes more useful for models to understand language in such stage. In the convergence stage, the training loss decreases slowly to the end of the whole pre-training process.
So, how do the causal attention masks affect the pre-training process? The first two layers of DecBERT can break the "balance" of the multihead self-attention by design. The position bias from the attention masks makes the first two layers sensitive to word order information. In Figure 2, one can notice that the plateau stage of DecBERT is shortened (from around 7000 to 3000 steps). This reveals that DecBERT does not need to spend as much time as BERT to learn to make use of the position information. It can escape from the bagof-words sub-optimal point faster. Though the gap between BERT-reImp w/ PE and DecBERT-Diff w/ PE become smaller in the convergence stage, Figure 3 indicates that DecBERT-Diff w/ PE still has lower training loss in the whole pre-training process.

Large-scale Pre-training
In the large-scale pre-training scenario, we intend to verify whether our modification still achieves better performance. From Figure 4 and  the small-scale pre-training scenario. For the validation PPL, DecBERT-Diff achieves lower scores than BERT-reImp in the whole pre-training process. Especially, at the 13 th epoch (265k steps), the valid PPL score of DecBERT-Diff is 3.48, which is the same as BERT-reImp at the 15 th epoch (300k steps). This suggests that the pre-training process of DecBERT-Diff is about 2 epochs faster than BERT-reImp. Combing our previous analysis, one advantage of our modification is that it can accelerate the pre-training process. Comparing the downstream tasks, one can also notice that the performance gap between DecBERT-Diff and BERT-reImp even becomes larger. The average score is 1.2 points higher.
All results in this part indicate that our modification is effective not only in the small-scale pretraining, but also in the large-scale pre-training. It can accelerate the pre-training process. When pretraining with the same amount of computational resources, our modification can achieve better performance on masked language modeling and downstream tasks.

Discussion
The analysis and experimental results detailed in the previous sections point out an interesting finding that the pre-training process of BERT can be divided into different stages. A similar phenomenon also can be found in the work of Kovaleva et al. (2021). In their work, they find that both scaling factors and biases of the Layer Normalization begin to diverge from their initialization values quickly in the "diving" stage. Especially, one/two specific neurons of the biases have larger and larger absolute values. Luo et al. (2021) indicates that such neurons are highly related to the positional informa- tion. These complement our possible explanation that in the plateau stage, the model needs to learn suitable position embeddings. Then in the "diving" stage, the model learns to adopt such embeddings to better model language. Our DecBERT models indicate that breaking the "balance" by design can help BERT better capture the position information, which leads to better performance.
One would wonder how about the fixed sinusoidal position embeddings. With such embeddings, BERT does not need to learn suitable position embeddings during pre-training. Based on our previous analysis, the plateau stage is possible to disappear. To examine whether such position embeddings are better, we conduct an extra smallscale pre-training experiment. The pre-training loss curve is in Figure 5, revealing that the plateau stage indeed disappears. This is in line with our previous results. However, in the convergence stage, we find that BERT with the sinusoidal PE has higher pre-training loss than using the learnable PE. This indicates that the learnable position embeddings are more suitable for BERT.

Related Work
The previous works (Vaswani et al., 2017;Shaw et al., 2018;Huang et al., 2019;Dai et al., 2019; indicate that the self-attention mechanism of Transformer Encoder is permutation equivalent, so it needs to use the position embedding. Tsai et al. (2019) have proved that Decoder's self-attention is not permutation equivalent, indicating that Decoder is not a bag-of-word model as Encoder, but they do not conduct further analysis on Decoder's position encoding ability. Apart from the analysis, Irie et al. (2019) train the Transformer Language Models with speech dataset. They find that models without position embeddings have lower perplexity scores. Schlag et al. (2021a) introduce a new Linear Transformer Language Model with fast weight memories (Schmidhuber, 1992;Schlag et al., 2021b), which has lower perplexity without position encodings on the WikiText-103 dataset.
Furthermore, an explosion of work focuses on proposing a better method to add the position information into the pre-trained language model. Dufter et al. (2021) give a comprehensive introduction of different position encodings methods of Transformer. They divide position encodings into three approaches. One line of such work is to add position embeddings to the input before it is fed to the actual Transformer model (Vaswani et al., 2017;Shaw et al., 2018;Devlin et al., 2019;Kitaev et al., 2020;Press et al., 2020;Wang et al., 2020). The second line of work directly modify the attention matrix (Dai et al., 2019;Dufter et al., 2020;He et al., 2020;Wu et al., 2021a;Ke et al., 2021;Su et al., 2021). The last one combine the first two approaches together. However, all of them focus on introducing an extra set of parameters to trace the word order. Our work chooses to make use of the causal attention masks.
Most similar to our modification in Section 3.2, Im and Cho (2017) propose a self-attention based model which achieve better performance on SNLI task (Bowman et al., 2015) without the help of explicit position encodings. However, their models are different from the standard Transformer and use extra local attention masks to control the information flow. With the popularity of the Transformer model in the Computer Vision field, some works propose different methods to make Vision Transformer know word order implicitly (Chu et al., 2021;Yuan et al., 2021;Wu et al., 2021b), but all of them modify the models with convolution neural network (Lecun et al., 1998).

Conclusion
In this work, we introduce a new pre-trained model, called DecBERT, adopting the causal attention masks to enhance the language understanding of BERT. We conduct a series of experiments to verify the effectiveness of our models. Experimental results indicate that our proposed models achieve better performance than BERT on most downstream tasks when pre-training with the same amount of data and computational resources. Moreover, our analysis also indicates that our models can accelerate the pre-training process. 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 1e-6 Adam β 1 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 Adam