GhostBERT: Generate More Features with Cheap Operations for BERT

Transformer-based pre-trained language models like BERT, though powerful in many tasks, are expensive in both memory and computation, due to their large number of parameters. Previous works show that some parameters in these models can be pruned away without severe accuracy drop. However, these redundant features contribute to a comprehensive understanding of the training data and removing them weakens the model’s representation ability. In this paper, we propose GhostBERT, which generates more features with very cheap operations from the remaining features. In this way, GhostBERT has similar memory and computational cost as the pruned model, but enjoys much larger representation power. The proposed ghost module can also be applied to unpruned BERT models to enhance their performance with negligible additional parameters and computation. Empirical results on the GLUE benchmark on three backbone models (i.e., BERT, RoBERTa and ELECTRA) verify the efficacy of our proposed method.

Previous works show that there are some redundant features in the BERT model, and unimportant attention heads or neurons can be pruned away without severe accuracy degradation (Michel et al., 2019;. However, for computer vision (CV) tasks, it is shown in  that redundant features in convolutional neural networks also contribute positively to the performance, and using cheap linear operations to generate more ghost feature maps enhances the performance with few additional parameters. On the other hand, it is shown in (Voita et al., 2019;Kovaleva et al., 2019;Rogers et al., 2020) that many attention maps in pre-trained language models exhibit typical positional patterns, e.g., diagonal or vertical, which can be easily generated from other similar ones using operations like convolution.
Based on the above two aspects, in this paper, we propose to use cheap ghost modules on top of the remaining important attention heads and neurons to generate more features, so as to compensate for the pruned ones. Considering that the convolution operation (1) encodes local context dependency, as a complement of the global self-attention in Transformer models (Wu et al., 2020); and (2) can generate some BERT features like positional attention maps from similar others, in this work, we propose to use the efficient 1-Dimensional Depthwise Separable Convolution (Wu et al., 2019)   as the original ones, we use a softmax function to normalize the convolution kernel.
Afterwards, we fine-tune the parameters in both the BERT backbone model and the added ghost modules. Note that the ghost modules are not necessarily applied to pruned models. They can also be directly applied to pre-trained language models for better performance while with negligible additional parameters and floating-point operations (FLOPs). Figure 1 summarizes the average accuracy versus parameter size and FLOPs on the GLUE benchmark, where adding ghost modules to both the unpruned (m = 12/12) and pruned (m < 1) BERT models perform better than the counterparts without ghost modules. More experiments on the GLUE benchmark show that with only 0.4% more parameters and 0.9% more FLOPs, the proposed ghost modules improve the average accuracy of BERT-base, RoBERTa-base and ELECTRA-small by 0.9, 0.6, 2.4 points, respectively. When applying ghost modules to small or pruned models, the resultant models outperform other BERT compression methods.

Approach
In this section, we first introduce where to add ghost modules in a BERT model (Section 2.1), and then discuss the components and optimization details of the ghost module (Section 2.2).

Adding Ghost Modules to BERT
The BERT model is built with Transformer layers, each of which contains a Multi-Head Attention (MHA) layer and a Feed-Forward Network (FFN), as well as skip connections and layer normalizations.  show that the computations for attention heads of MHA and neurons in the intermediate layer of FFN can be performed in parallel. Thus the BERT model can be compressed in a structured manner by pruning parameters associated with these heads and neurons . In this paper, after pruning the unimportant heads and neurons, we employ cheap ghost modules upon the remaining ones to generate more ghost features to compensate for the pruned ones.
For simplicity of notation, we omit the bias terms in linear and convolution operations where applicable in the rest of this work.

Ghost Module on MHA
Following , we divide the computation of MHA into the computation of each attention head. Specifically, suppose the sequence length and hidden state size are n and d, respectively. Each transformer layer consists of N H attention heads. For input matrix X ∈ R n×d , the h-th attention head computes its output as are the projection matrices associated with it. In multi-head attention, N H heads are computed in parallel to get the final output: (1) Given a width multiplier m ≤ 1, we keep M = N h m heads and use them to generate F ghost features. The f th ghost feature is generated by where G f,h is the proposed cheap ghost module which generates features from the h th attention head's representation to the f th ghost feature. ReLU is used as the nonlinearity function. Thus the computation of MHA in the GhostBERT is: Besides being added to the output of MHA, the ghost modules can also be added to other positions in MHA. Detailed discussions are in Section 4.2.

Ghost Module on FFN
Similar to the attention heads in MHA, the computation of FFN can also be divided into computations for each neuron in the intermediate layer of FFN . With a slight abuse of notation, we still use X ∈ R n×d as the input to FFN. Denote the number of neurons in the intermediate layer as d f f , the computation of FFN can be writ- For simplicity, we also use width multiplier m for FFN as MHA, and divide these neurons into N H folds, where each fold contains d f = d f f /N H neurons. For the h-th fold, its output can be com- are the parameters associated with it. In FFN, N H folds are computed in parallel to get the output: For width multiplier m, we keep M folds of neurons and use ghost modules to generate F ghost features as in Equation (2). Thus the computation of FFN in the GhostBERT can be written as:

Ghost Module
In the previous section, we discussed where we insert the ghost modules in the Transformer layer. In this section, we elaborate on the components and normalization of the ghost modules.
Generally speaking, any function can be used as the ghost module G in Equation (2). Considering that (i) convolution operation can encode local context dependency, as a compensation for the global self-attention (Wu et al., 2020;; and (ii) features like diagonal or vertical attention maps (Kovaleva et al., 2019;Rogers et al., 2020) can be easily generated by convolving similar others, we consider using convolution as the basic operation in the ghost module.

Convolution Type
With a slight abuse of notation, here we still use X ∈ R n×d as the input to the convolution, i.e., the output H h of h th head in MHA or h th fold of neurons in FFN. Denote O ∈ R n×d as the output of the convolution in the ghost module.
1-Dimensional convolution (Conv1D) over the sequence direction encodes local dependency over contexts, and has shown remarkable performance for NLP tasks (Wu et al., 2019(Wu et al., , 2020. To utilize the representation power of Conv1D without too much additional memory and computation, we choose 1-Dimensional Depthwise Separable Convolution (DWConv) (Wu et al., 2019) for the ghost module. Compared with Conv1D, DWConv performs a convolution independently over every channel, and reduces the number of parameters from d 2 k to dk (where k is the convolution kernel size). Denote the weight of the DWConv operation as W ∈ R d×k . After applying DWConv, the output for the i th token and c th channel can be written as:

Normalization
Since the parameters of the BERT backbone model and the ghost modules can have quite different scales and optimization behaviors, we use a softmax function to normalize each convolution kernel W c,: across the sequence dimension as Softmax(W c,: ) before convolution as Wu et al. (2019). By softmax normalization, the weights in one kernel are summed up to 1, ensuring that the convolved output has a similar scale as the input. Thus after applying the ghost module, the output for the i th token and c th channel can be written as:

Training Details
To turn a pre-trained BERT model into a smallersized GhostBERT, we do the following three steps:  Table 1: Development set results of the baseline pre-trained language models and our proposed method on the GLUE benchmark. Both pruned and unpruned BERT-base (resp. RoBERTa-base) are used as the backbone models for GhostBERT (resp. GhostRoBERTa). The unpruned ELECTRA-small is used as the backbone model for the GhostELECTRA-small. m is the width multiplier written in the form of proportion, whose numerator and denominator represent the remaining attention heads/folds of neurons and the total number of heads/folds, respectively.
Pruning. For a certain width multiplier m, we prune the attention heads in MHA and neurons in the intermediate layer of FFN from a pre-trained BERT-based model following .
Distillation. Then we add ghost modules to the pruned model as in Section 2.1. Suppose there are L Transformer layers. We distill the knowledge from the embedding (i.e., the output of the embedding layer) E, hidden states M l after MHA and F l after FFN (where l = 1, 2, · · · , L) from the full-sized teacher model to E m , M m l , F m l of the student GhostBERT. Following (Jiao et al., 2020), we use the augmented data for distillation. Denote MSE as the mean squared error, the three loss terms are emb = MSE(E m , E), respectively. Thus, the distillation loss function is: Fine-tuning. Denote y as the predicted logits, we finally fine-tune the GhostBERT with groundtruth labelsŷ as: Note that instead of being applied to pruned models, the cheap ghost modules can also be directly applied to a pre-trained model for better performance while with negligible additional parameters and FLOPs. In this case, the training procedure contains only the distillation and fine-tuning steps.
Empirically, to save memory and computation, we generate one ghost feature for each MHA or FFN (i.e., F = 1 in Equations (3) and (5)), and let all ghost modules G f,h share the same parameters with each other. As will be shown in Section 3, adding these simplified ghost modules already achieve clear performance gain empirically.

Experiment
In this section, we show the efficacy of the proposed method with (pruned) BERT (Devlin et al., 2019), RoBERTa  and ELEC-TRA (Clark et al., 2020) as backbone models.

Setup
Experiments are performed on the GLUE benchmark (Wang et al., 2019), which consists of various natural language understanding tasks. More statistics about the GLUE datasets are in Appendix A.1. Following (Clark et al., 2020), we report Spearman correlation for STS-B, Matthews correlation for CoLA and accuracy for the other tasks. For MNLI, we report the results on the matched section. The convolution kernel size in the ghost module is set as 3 unless otherwise stated. The detailed hyperparameters for training the GhostBERT are in Appendix A.2. The model with the best development set performance is used for testing. For each method, we also report the number of parameters  and FLOPs at inference (Details can be found in Appendix A.3). We compare our proposed method against the following methods: (i) baseline pre-trained language models: BERT-base (Devlin et al., 2019), RoBERTa-base  and ELECTRAsmall (Clark et al., 2020); (ii) BERT compression methods: TinyBERT (Jiao et al., 2020), Con-vBERT , and MobileBERT (Sun et al., 2020). The development set results of RoBERTa-base are from . The test set results of ELECTRA, BERT-base and Con-vBERT are from . The others are from their original papers or repositories. Table 1 shows the GLUE development set results of the baseline pre-trained language models and our proposed method. When the cheap ghost modules are directly applied to these unpruned pretrained models, better performances are achieved with only negligible additional parameters and FLOPs. Specifically, adding ghost modules to BERT-base, RoBERTa-base and ELECTRA-small increases the average development accuracy by 0.9, 0.6, 2.4 points with only 55.3K more parameters, and 14.2M more FLOPs. For the test set, the average performance gains are 0.8, 1.1, 2.4 points.
Comparison with Other Compression Methods. Table 2 shows the comparison between the proposed method and other popular BERT compression methods. Under similar parameter sizes or FLOPs, the proposed GhostBERT performs comparably as the other BERT compression methods, while GhostRoBERTa often outperforms them. In particular, GhostELECTRA-small has over 1.5 points or higher accuracy gain than other similar-sized small models like ELECTRA-small, TinyBERT 4 and ConvBERT-small.
In Table 3 and Figure 1, we also compare the pruned BERT with and without ghost modules. For fair comparison, for the pruned model without ghost module, we use the same training procedure as Section 2.3. As can be seen, adding the ghost modules achieves considerable improvement with negligible additional memory and computation.

Ablation Study
In this section, we perform ablation study in the (i) training procedure: including data augmentation (DA) and knowledge distillation (KD); (ii) ghost module: including convolution kernel size, softmax normalization over the convolution kernel and nonlinearity for each ghost feature in Equation (2).
Training Procedure. Table 4 verifies the effectiveness of the Data Augmentation (DA) and Knowledge Distillation (KD) upon the GhostBERT model with width multiplier m ∈ {3/12, 1/12}. The GhostBERT incurs severe accuracy drop without DA and KD. with a drop of 3.5 and 6.4 points on average, for m = 3/12 and 1/12, respectively.
Ghost Module. Table 4 also shows the effectiveness of the softmax normalization over the convolution kernel and ReLU nonlinearity in Equation (2). As can be seen, dropping the softmax normalization or ReLU nonlinearity reduces the average accuracy by 0.8 and 1.6 points respectively for m = 3/12, and 0.9 and 2.2 points respectively for m = 1/12. Further, we explore whether the kernel size plays an important role in the DWConv in the ghost module. Figure 3 shows the results of GhostBERT with width multipliers m ∈ {3/12, 1/12}, with various convolution kernel sizes in DWConv. Average accuracy over five tasks is reported. Detailed results for each task can be found in Table 9 in Appendix B.1. As can be seen, the performance of Ghost-BERT increases first and then decreases gradually as the kernel size increases. For both width multipliers, kernel size 3 performs best and is used as the default kernel size in other experiments unless otherwise stated.

Discussion
In this section, we discuss about different choices of which type of convolution to use in the ghost module (Section 4.1), and where to posit the ghost modules in a BERT model (Section 4.2).

Ghost Module Types
Besides the DWConv in Section 2.2, in this section, we discuss more options for the convolution in the ghost module. We follow the notation in Section 2.2 and denote the input, output, kernel size of the convolution as X, W and k, respectively.
1-Dimensional Convolution. If the kernel convolves input over the sequence direction (abbreviated as Conv1D S), the number of input and output channel is d, and the weight W has shape W ∈ R d×d×k . After applying Conv1D S, the output for the i th token and c th channel is: If the kernel convolves input over the feature direction (abbreviated as Conv1D F), the number of input and output channel is n, and the weight has shape W ∈ R n×n×k . After applying Conv1D F, the output for the i th token and c th channel is:

2-Dimensional Convolution (Conv2D). For
Conv2D, the number of input and output channels are both 1, and thus the weight W has shape W ∈ R 1×1×k×k . After applying Conv2D, the output for the i th token and c th channel is: Table 5 shows the comparison of using different convolutions for the ghost module. For 1-Dimensional convolution, Conv1D S performs better Conv1D F. This may because that convolving over the sequence urges the model to learn the dependencies among tokens.   Though 2-Dimensional convolution (Conv2D) is quite successful in CV tasks, it performs much worse than Conv1D S here. This may because the two dimensions of feature maps in CV tasks encode similar information, while those of hidden states in Transformers encode quite different information (i.e., feature and sequence). Thus Conv2D results in worse performance than Conv1D S, though much fewer parameters and FLOPs are required.

Comparison of Different Convolutions
On the other hand, DWConv achieves comparable performance as Conv1D S, while being much more efficient in terms of number of parameters and FLOPs, by performing the convolution independently over every feature dimension.

Ghost Module Positions
In this section, we explore more possible positions of adding the ghost module. For MHA, besides adding ghost module after the projection layer (After O in Figure 4(c)) as in Section 2.1.1, we can also add it right after calculating the attention score (After QK in Figure 4(a)), or after multiplying the attention score and the value layer (After V in Figure 4(b)). For FFN, besides adding the ghost module after the second linear layer (After FFN2 in Figure 4(e)) as in Section 2.1.1, we can also add it after the intermediate layer (After FFN1 in Figure 4(d)). Note that we use Conv2D as the ghost module for After QK because the attention map encodes attention probabilities in both dimensions. For After QK and After V, to match the dimension of other parameters, the number of input and output channels are M and N H − M , respectively. Table 6 shows the results of adding one ghost module to the same position for each Transformer layer. As can be seen, adding ghost module upon the attention maps (After QK) performs best. However, since the parameters in the value and projection layer of MHA are left unpruned, After QK has much more parameters and FLOPs than the other positions. Adding ghost modules to the other four positions has similar average accuracy. Thus in this work, for MHA, we choose the most memory-and computation-efficient strategy After O. Similarly, for FFN, we also add ghost modules to the final output (After FFN2). From Table 6, our way of adding ghost modules has comparable performance as After QK, while being much more efficient in parameter size and FLOPs.     In the depth direction, pruning Transformer layers is proposed in LayerDrop  via structured dropout. Efficient choice of Transformer layers at inference via early exit are also proposed in Xin et al., 2020;.  perform structured pruning in both width and depth directions. The importance of attention heads and neurons in the intermediate layer of Feed-forward network is measured by their impact on the loss, and the least important heads and neurons are pruned away.

Enhanced Representation in
Transformer-based Models Various methods have been proposed to use linear or convolution operations to enhance the representation of the Transformer layers. The first group of research works replaces the self-attention mechanism or feed-forward networks with simpler and more efficient convolution operations, while maintaining comparable results. Wu et al. (2019) introduce the token-based dynamic depth-wise convolution to compute the importance of context elements, and achieve better results in various NLP tasks. Iandola et al. (2020) replace all the feed-forward networks with grouped convolution. AdaBERT  uses differentiable neural architecture to search for more efficient convolution-based NLP models. The second group uses linear or convolutional module along with the self-attention mechanism for more powerful representation. The new module can be incorporated though serial connection to the original self-attention mechanism (Mehta et al., 2020), or be used in parallel with the original selfattention mechanism (Wu et al., 2020; to capture both local and global context dependency. Serial and parallel connections of these linear or convolution operations to Transformer layers are also extended to multi-task (Houlsby et al., 2019;Stickland and Murray, 2019) and multilingual tasks (Pfeiffer et al., 2020).
Note that the proposed ghost modules are orthogonal to the above methods in that these modules are used to generate more features for the Transformer models and can be easily integrated into existing methods to boost their performance.

Conclusion
In this paper, we propose GhostBERT to generate more features in pre-trained model with cheap operations. We use the softmax-normalized 1-Dimensional Convolutions as ghost modules and add them to the output of the MHA and FFN of each Transformer layer. Empirical results on BERT, RoBERTa and ELECTRA demonstrate that adding the proposed ghost modules enhances the representation power and boosts the performance of the original model by supplying more features.

A.1 Statistics of GLUE datasets
The GLUE benchmark (Wang et al., 2019) consists of various sentence understanding tasks, including two single-sentence classification tasks (CoLA and SST-2), three similarity and paraphrase tasks (MRPC, STS-B and QQP), and four inference tasks (MNLI, QNLI, RTE and WMLI). For MNLI task, we report the result on the matched section. For Winograd Schema (WNLI), it is a small natural inference dataset while even a majority baseline outperforms many methods on it. As is noted in the GLUE official website 1 , there are some issues with the construction of it. Like previous work , we do not experiment on WNLI. We use the default train/development/test splits from the official website.

Corpus Train Test Task Metrics
Single-Sentence Tasks

A.2 Hyperparameters
We show the detailed hyperparameters for the distillation and fine-tuning stages in Section 2.3 of the proposed method on the GLUE benchmark in Table 8.

A.3 FLOPs
Floating-point operations (  follow the setting in  and infer FLOPs with batch size 1 and sequence length 128. Since the operations in the embedding lookup are relatively cheap compared to those in Transformer layers, following Sun et al., 2020), we do not count them. Note that the reported FLOPs for ELECTRA (Clark et al., 2020) and ConvBERT  in their original papers include those for the embedding lookup, and are slightly different from the numbers in this paper.

B.1 Full Results of Different Convolution Kernel Sizes
In Table 9, we show the detailed results of different convolutions kernel sizes for each of the five tasks (SST-2, MRPC, CoLA, STS-B and RTE). As can be seen, for each task, DWConv with kernel size 3 has the best performance.

B.2 Full Results of Pruned BERT
In Table 10, we show the detailed results of the pruned BERT and the GhostBERT for each task.
We can see that under the same training procedure, the GhostBERT outperforms the pruned BERT over all compared sizes. Table 11 shows the detailed results of adding ghost modules to different positions of the model.

B.4 Generating More Features
As is mentioned at the end of Section 2.3, we generate only one ghost feature for each MHA and FFN, i.e., F = 1 to save computation and memory. Indeed, our framework has no limitation on F , and also allows the model to generate more features (i.e., F > 1). In this section, we discuss the relationship between generating more ghost features and the computation/memory requirements. Following the notation in Section 2 and omitting the cheap computation of ReLU and softmax, generating F ghost features from M features for all L layers requires 2LM F dk additional parameters and 4LM F ndk additional FLOPs. Both of them scale linearly as F , and can be large when F is large. For instance, for BERT-base with d = 768, when n = 128, k = 3, M = 12 and F = 12, the additional #parameters and FLOPs are 8M and 2.0G respectively, accounting for 7.2% and 9.1% of the backbone model.
When F increases, the accuracy of GhostBERT first increases slowly and soon begins to saturate or decrease. E.g., for GhostBERT (m = 1/12), the average development accuracy on GLUE only increases from 80.3 to 80.6 when F increases from 1 to 4, and then saturates when F > 4. For Ghost-BERT (m = 3/12), the highest accuracy 83.1 is achieved when F = 1 or 2, and then the accuracy begins to decrease.
Thus in the paper, we simply choose F=1 which is cheap, but already achieves good performance on most tasks.