Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

Recently, knowledge distillation (KD) has shown great success in BERT compression. Instead of only learning from the teacher’s soft label as in conventional KD, researchers find that the rich information contained in the hidden layers of BERT is conducive to the student’s performance. To better exploit the hidden knowledge, a common practice is to force the student to deeply mimic the teacher’s hidden states of all the tokens in a layer-wise manner. In this paper, however, we observe that although distilling the teacher’s hidden state knowledge (HSK) is helpful, the performance gain (marginal utility) diminishes quickly as more HSK is distilled. To understand this effect, we conduct a series of analysis. Specifically, we divide the HSK of BERT into three dimensions, namely depth, length and width. We first investigate a variety of strategies to extract crucial knowledge for each single dimension and then jointly compress the three dimensions. In this way, we show that 1) the student’s performance can be improved by extracting and distilling the crucial HSK, and 2) using a tiny fraction of HSK can achieve the same performance as extensive HSK distillation. Based on the second finding, we further propose an efficient KD paradigm to compress BERT, which does not require loading the teacher during the training of student. For two kinds of student models and computing devices, the proposed KD paradigm gives rise to training speedup of 2.7x 3.4x.


Introduction
Since the launch of BERT (Devlin et al., 2019), pre-trained language models (PLMs) have been advancing the state-of-the arts (SOTAs) in a wide range of NLP tasks. At the same time, the growing * Work was done when Yuanxin Liu was an intern at Pattern Recognition Center, WeChat AI, Tencent Inc, China. † Zheng Lin is the corresponding author.   and TinyBERT (Jiao et al., 2020) on QNLI with the increase of HSK.
size of PLMs has inspired a wave of research interest in model compression (Han et al., 2016) in the NLP community, which aims to facilitate the deployment of the powerful PLMs to resource-limited scenarios.
Knowledge distillation (KD) (Hinton et al., 2015) is an effective technique in model compression. In conventional KD, the student model is trained to imitate the teacher's prediction over classes, i.e., the soft labels. Subsequently, Romero et al. (2015) find that the intermediate representations in the teacher's hidden layers can also serve as a useful source of knowledge. As an initial attempt to introduce this idea to BERT compression, PKD (Sun et al., 2019) proposed to distill representations of the [CLS] token in BERT's hidden layers, and later studies (Jiao et al., 2020;Sun et al., 2020;Hou et al., 2020; extend the distillation of hidden state knowledge (HSK) to all the tokens.
In contrast to the previous work that attempts to increase the amount of HSK, in this paper we explore towards the opposite direction to "compress" HSK. We make the observation that although distilling HSK is helpful, the marginal utility diminishes quickly as the amount of HSK increases. To understand this effect, we conduct a series of analysis by compressing the HSK from three dimensions, namely depth, length and width (see Section 2.3 for detailed description). We first compress each single dimension and compare a variety of strategies to extract crucial knowledge. Then, we jointly compress the three dimensions using a set of compression configurations, which specify the amount of HSK assigned to each dimension. Figure 1 shows the results on QNLI dataset. We can find that 1) perceivable performance improvement can be obtained by extracting and distilling the crucial HSK, and 2) with only a tiny fraction of HSK the students can achieve the same performance as extensive HSK distillation.
Based on the second finding, we further propose an efficient paradigm to distill HSK. Concretely, we run BERT over the training set to obtain and store a subset of HSK. This can be done on cloud devices with sufficient computational capability. Given a target device with limited resource, we can compress BERT and select the amount of HSK accordingly. Then, the compressed model can perform KD on either the cloud or directly on the target device using the selected HSK and the original training data, dispensing with the need to load the teacher model.
In summary, our maojor contributions are: • We observe the marginal utility diminishing effect of HSK in BERT KD. To our knowledge, we are the first attempt to systematically study knowledge compression in BERT KD.
• We conduct exploratory studies on how to extract the crucial knowledge in HSK, based on which we obtain perceivable improvements over a widely-used HSK distillation strategy.
• We propose an efficient KD paradigm based on the empirical findings. Experiments on the GLUE benchmark for NLU (Wang et al., 2019) show that, the proposal gives rise to training speedup of 2.7× ∼3.4× for Tiny-BERT and ROSITA on GPU and CPU 1 .

BERT Architecture
The backbone of BERT consists of an embedding layer and L identical Transformer (Vaswani et al., 2017) layers. The input to the embedding layer is a 1 The code is available at https://github.com/ llyx97/Marginal-Utility-Diminishes text sequence x tokenized by WordPiece (Wu et al., 2016). There are two special tokens in x: [CLS] is inserted in the left-most position to aggregate the sequence representation and [SEP] is used to separate text segments. By summing up the token embedding, the position embedding and the segment embedding, the embedding layer outputs a sequence of vectors E = e 1 , · · · , e |x| ∈ R |x|×d H , where d H is the hidden size of the model.
Then, E passes through the stacked Transformer layers, which can be formulated as: where H l = h l,1 , · · · , h l,|x| ∈ R |x|×d H is the outputs of the l th layer and H 0 = E. Each Transformer layer is composed of two sub-layers: the multi-head self-attention layer and the feedforward network (FFN). Each sub-layer is followed by a sequence of dropout (Srivastava et al., 2014), residual connection (He et al., 2016) and layer normalization (Ba et al., 2016). Finally, for the tasks of NLU, a task-specific classifier is employed by taking as input the representation of [CLS] in the L th layer.

BERT Compression with KD
Knowledge distillation is a widely-used technique in model compression, where the compressed model (student) is trained under the guidance of the original model (teacher). This is achieved by minimizing the difference between the features produced by the teacher and the student: where f S , f T is a pair of features from student and teacher respectively. L is the loss function and x is a data sample. In terms of BERT compression, the predicted probability over classes, the intermediate representations and the self-attention distributions can be used as the features to transfer. In this paper, we focus on the intermediate representations {H l } L l=0 (i.e., the HSK), which have shown to be a useful source of knowledge in BERT compression. The loss function is computed as the Mean Squared Error (MSE) in a layer-wise way: where L is the student's layer number and g(l) is the layer mapping function to select teacher layers. W ∈ R d S H ×d T H is the linear transformation to project the student's representations H S l to the same size as the teacher's representation H T l .

HSK Compression
According to Equation 3, the HSK from teacher can be stacked into a tensor H T = H T g(0) , · · · , H T g(L ) ∈ R (L +1)×|x|×d T H , which consists of three structural dimensions, namely depth, length and width. For the depth dimension, H T can be compressed by eliminating entire layers. By dropping the representations corresponding to particular tokens, we compress the length dimension. When it comes to the width dimension, we set the eliminated activations to zero. We will discuss the strategies to compress each dimension later in Section 4.

Datasets
We perform experiments on seven tasks from the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019): CoLA (linguistic acceptability), SST-2 (sentiment analysis), RTE, QNLI, MNLI-m and MNLI-mm (natural language inference), MRPC and STS-B (semantic matching/similarity). Due to space limitation, we only report results on CoLA, SST-2, QNLI and MNLI for single-dimension HSK compression in Section 4, and results on the other three tasks are presented in Appendix E.

Evaluation
Following (Devlin et al., 2019), for the dev set, we use Matthew's correlation and Spearman correlation to evaluate the performance on CoLA and STS-B respectively. For the other tasks, we report the classification accuracy. We use the dev set to conduct our exploratory studies and the test set results are reported to compare HSK compression with the existing distillation strategy. For the test set of MRPC, we report the results of F1 score.

Implementation Details
We take two representative KD-based methods, i.e., TinyBERT (Jiao et al., 2020) and ROSITA , as examples to conduct our analysis. TinyBERT is a compact version of BERT that is randomly initialized. It is trained with two-stage KD: first on the unlabeled general domain data and then on the task-specific training data. ROSITA replaces the first stage KD with structured pruning and matrix factorization, which can be seen as a direct transfer of BERT's knowledge from the model parameters.
We focus on KD with the task-specific training data and do not use any data augmentation. For TinyBERT, the student model is initialized with the 4-layer general distillation model provided by Jiao et al. (2020) (denoted as TinyBERT 4 ). For ROSITA, we first fine-tune BERT BASE on the downstream task and then compress it following  to obtain a 6-layer student model (denoted as ROSITA 6 ). The fine-tuned BERT BASE is used as the shared teacher for Tiny-BERT and ROSITA. Following Jiao et al. (2020), we first conduct HSK distillation as in Equation 3 (w/o distilling the self-attention distribution) and then distill the teacher's predictions using crossentropy loss. All the results are averaged over three runs with different random seeds. The model architecture of the students and the hyperparameter settings can be seen in Appendix A and Appendix B respectively.

Single-Dimension Knowledge Compression
Researches on model pruning have shown that the structural units in a model are of different levels of importance, and the unimportant ones can be dropped without affecting the performance. In this section, we investigate whether the same law holds for HSK compression in KD. We study the three dimensions separately and compare a variety of strategies to extract the crucial knowledge. When a certain dimension is compressed, the other two dimensions are kept to full scale.

Compression Strategies
From the layer point of view, HSK compression can be divided into two steps. First, the layer mapping function g(l) selects one of the teacher layers for each student layer. This produces L + 1 pairs of teacher-student features: (H S 0 , H T g(0) ), · · · , (H S L , H T g(L ) ) . Second, a subset of these feature pairs are selected to perform HSK distillation.
For the first step, a simple but effective strategy  is the uniform mapping function: In this way, the teacher layers are divided into L blocks and the top layer of each block serves as the guidance in KD. Recently, Wang et al. (2020a) empirically show that the upper-middle layers of BERT, as compared with the top layer, are a better choice to guide the top layer of student in selfattention distillation. Inspired by this, we redesign Equation 4 to allow the top student layer to distill knowledge from an upper-middle teacher layer, and the lower layers follow the uniform mapping principle. This function can be formulated as: where L top is the teacher layer corresponding to the top student layer and round() is the roundingoff operation. Figure 2 gives an illustration of g(l, L top ) with a 6-layer teacher and a 3-layer student. Specifically, for the 12-layer BERT BASE teacher, we select L top from {8, 10, 12}. For the second step, we simply keep the top N D feature pairs: Figure 3 presents the results of depth compression with different layer mapping functions. We can find that: 1) For the g(l, 12) mapping function (the grey lines), depth compression generally has a negative impact on the students' performance. Specially, the performance of ROSITA 6 declines drastically when the number of layers is reduced to 1 ∼ 3. 2) In terms of the g(l, 10) and g(l, 8) mapping functions (the blue and orange lines), HSK distillation with only one or two layers can achieve comparable performance as using all the L + 1 layers.  On the QNLI and MNLI datasets, the performance can even be improved by eliminating the lower layers. 3) In general, the student achieves better results with the redesigned layer mapping function in Equation 5 across the four tasks. This demonstrates that, like the self-attention knowledge, the most crucial HSK does not necessarily reside in the top BERT layer, which reveals a potential way to improve HSK distillation of BERT. 4) Compared with g(l, 8), the improvement brought by g(l, 10) is more stable across different tasks and student models. Therefore, we use the g(l, 10) layer mapping function when investigating the other two dimensions.

Compression Strategies
To compress the length dimension, we design a method to measure the tokens' importance by using the teacher's self-attention distribution. The intuition is that self-attention controls the information flow among tokens across layers, and thus the representations of the most attended tokens may contain crucial information.
Assuming that the teacher has A h attention heads, and the attention weights in the l th layer , where A T l,a ∈ R |x|×|x| is the attention matrix of the a th head. Each row of A T l,a is the attention distribution of a particular token to all the tokens. In our length compression strategy, the importance score of the tokens is the attention distribution of the [CLS] token (i.e., the first row in A T l,a ) averaged over the A h heads: To match the depth of the student, we employ the layer mapping function in Equation 5 to select S g(l,L top ) for the l th student layer. The length compression strategies examined in this section are summarized as: Att is the attention-based strategy as described above. The layer mapping function to select S is the same as the one to select HSK, i.e., g(l, 10).
Att w/o [SEP] excludes the HSK of the special token [SEP]. The rationality of this operation will be explained in the following analysis.
Left is a naive baseline that discards tokens from the tail of the text sequence. When the token number is reduced to 1, the student only distills the HSK from the [CLS] token.

Results and Analysis
The length compression results are shown in Figure 4 and Figure 5. We can derive the following observations: 1) For all strategies, significant performance decline can only be observed when HSK length is compressed heavily (to less than 0.05 ∼ 0.30). In some cases, using a subset of tokens' representation even leads to perceivable  improvement over the full length (e.g., ROSITA 6 on CoLA and TinyBERT 4 on SST-2 and QNLI).
2) The performance of Att is not satisfactory. When being applied to ROSITA 6 , the Att strategy underperforms the Left baseline. The results of Att in TinyBERT 4 , though better than those in ROSITA 6 , still lag behind the other strategies at the left-most points. 3) Excluding [SEP] in the Att strategy alleviates the drop in performance, especially when HSK length is compressed to less than 0.05. 4) As a general trend, further improvement over Att w/o [SEP] can be obtained by using g(l, 12) in the selection of S, which produces the most robust results among the four strategies.
To explain why the Att strategy performs poorly, we inspect into the tokens that receive the highest importance scores under Equation 6. We find that the special token [SEP] is dominant in most hidden layers. As shown in Figure 6, from the 4 th ∼ 10 th  Figure 4 and Figure 5, it can be inferred that the representations of [SEP] is not a desirable source of knowledge for ROSITA and TinyBERT. We conjecture that this is because there exists some trivial patterns in the representations of [SEP], which prevents the student to extract the informative features that are more relevant to the task.

Compression Strategies
As discussed in Section 2.3, the width dimension is compressed by setting some activations in the intermediate representations to zero. Practically, we apply a binary mask M ∈ R d T H to the vectors in H T l , which gives rise to M h T l,1 , · · · , M h T l,|x| , where denotes the element-wise product. On this basis, we introduce and compare three masking designs for width compression: Rand Mask randomly set the values in M to zero, where the total number of "0" is controlled by the compression ratio. This mask is static, i.e., h T l,i (∀i, l) for all the training samples share the same mask.
Uniform Mask is also a static mask. It is constructed by distributing "0" in a uniform way. For-mally, the mask M is defined as: where is the indices of the remained N W activations.
Mag Mask masks out the activations with low magnitude. Therefore, this mask is dynamic, i.e., every h T l,i (∀i, l) has its own M.

Results and Analysis
The width compression results can be seen in Figure 7, from which we can obtain two findings. First, the masks reveal different patterns when combined with different student models. For ROSITA 6 , the performance of Rand Mask and Uniform Mask decreases sharply at 20% HSK width. In comparison, the performance change is not that significant when it comes to TinyBERT 4 . This suggests that TinyBERT 4 is more robust to HSK width compression than ROSITA 6 . Second, the magnitude-based masking strategy obviously outperforms Rand Mask and Uniform Mask. As we compress the nonzero activations in HSK from 100% to 20%, the performance drop of Mag Mask is only marginal, indicating that there exists considerable knowledge redundancy in the width dimension.

Three-Dimension Joint Knowledge Compression
With the findings in single-dimension compression, we are now at a position to investigate joint HSK compression from the three dimensions.

Measuring the Amount of HSK
For every single dimension, measuring the amount of HSK is straightforward: using the number of layers, tokens and activations for depth, length and width respectively. In order to quantify the total amount of HSK (denoted as A HSK ), we define one unit of A HSK as the amount of HSK in any h T l,i (∀l ∈ [0, L], i ∈ [1, |x|]). In other words, the A HSK of H T equals to (L + 1) × |x|. When HSK is compressed to N D layers, N L tokens and N W activations, the A HSK is   rations for three-dimension (3D) HSK compression, and we could have multiple combinations of (N D , N L , N W ) that satisfy a particular A HSK . In practice, we reconstruct the search space as:

Compression Configurations & Strategies
(8) To study the student's performance with different amounts of HSK, we sample a set of configurations for a range of A HSK , the statistics of which is summarized in Table 1. Details of the configurations can be seen in Appendix C.
To compress each single dimension in joint HSK compression, we utilize the most advantageous strategies that we found in Section 4. Specifically, Att (L top = 12) w/o [SEP] is used to compress length, Mag Mask is used to compress width and the g(l, L top ) for depth compression is selected according to the performance of depth compression.

Results and Analysis
The results of 3D joint HSK compression are presented in Figure 8 and Figure 9. As we can see, introducing HSK in KD brings consistent improvement to the conventional prediction distillation method. However, the marginal benefit quickly diminishes as more HSK is included. Typically, with less than 1% of HSK, the student models can achieve the same or better result as full-scale HSK distillation. Over a certain threshold of A HSK , the  performance begins to decrease. Among different tasks and student models, the gap between the best results (peaks on the blue lines) and full-scale HSK distillation varies from 0.3 (ROSITA 6 on MNLI and STS-B) to 5.3 (TinyBERT 4 on CoLA). The results also suggest that existing BERT distillation method (i.e., g(l, 12)) can be improved by simply compressing HSK: Numerous points of different configurations lie over the red stars. Table 2 presents the results of different KDbased BERT compression methods. For fair comparison, we do not include other methods described in Section 7, because they either distill different type of knowledge or use different student model structure. Here, we focus on comparing the performance with or without HSK compression given the same student model. We can see that except for the results of a few tasks on the test sets, HSK compression consistently promotes the performance of the baseline methods.

Improving Training Efficiency
Existing BERT compression methods mostly focus on improving the inference efficiency. However, the teacher model is used to extract features throughout the training process, which suggests that the training efficiency still has room for improvement. As shown in Figure 10, the compressed models achieve considerable inference speedup, while the increase in training speed is relatively small. Moreover, for students with different sizes or architectures, the teacher should be deployed every time when training a new student. Intuitively, we can run the teacher once and reuse the features for all the students. In this way, we do not need to load the teacher model while training the student, and thereby increasing the training speed. We refer  to this strategy as offline HSK distillation 2 .
To evaluate the training efficiency of the proposed KD paradigm, we compute the training time on the MNLI dataset. The results are presented in the left plots of Figure 10. As we can see, offline HSK distillation increases the training speed of the student models, as compared with online distillation. The speedup is consistent for different student models and devices.
Despite the training speedup, however, loading and storing HSK increases the memory consumption. The full set of HSK can take up a large amount of space, especially for the pre-trained language models like BERT. Fortunately, our findings in the previous sections suggest that the student only requires a tiny fraction of HSK. Table 3 summarizes the actual memory consump-  tion of four configurations with different A HSK . As we can see, the full set of HSK for ROSITA 6 takes up approximately 1 TB of memory space, which is only applicable to some high-end cloud servers. Compressing the HSK can reduce the size to GB level, which enables training on devices like personal computers. It is worth noticing that storing the dynamic Mag Mask is consuming, which typically accounts for more space than HSK. However, the binary masks can be further compressed using some data compression algorithms. Based on the above results and analysis, we summarize our paradigm for efficient HSK distillation as: First, the teacher BERT runs on the training data to obtain and store the features of HSK and predictions. This can be done on devices that have sufficient computing and memory resources. Then, according to the target application and device, we decide the student's structure and the amount of HSK to distill. Finally, KD can be performed on a cloud server or directly on the target device.

Related Work
KD is widely studied in BERT compression. In addition to distilling the teacher's predictions as in Hinton et al. (2015), researches have shown that the student's performance can be improved by using the representations from intermediate BERT layers (Sun et al., 2019;Hou et al., 2020) and the self-attention distributions (Jiao et al., 2020;Sun et al., 2020). Typically, the knowledge is extensively distilled in a layer-wise manner. To fully utilize BERT's knowledge, some recent work also proposed to combine multiple teacher layers in BERT KD (Passban et al., 2021; or KD on Transformer-based NMT models (Wu et al., 2020). In contrast to these studies that attempt to increase the amount knowledge, we study BERT KD from the compression point of view. Similar idea can be found in MiniLMs (Wang et al., 2020a,b), which only use the teacher's knowledge to guide the last layer of student. However, they only con-sider knowledge from the layer dimension, while we investigate the three dimensions of HSK.
We explore a variety of strategies to determine feature importance for each single dimension. This is related to a line of studies called the attribution methods, which attempt to attribute a neural network's prediction to the input features. The attention weights have also been investigated as an attribution method. However, prior work (Wiegreffe and Pinter, 2019;Serrano and Smith, 2019;Brunner et al., 2020;Hao et al., 2020) finds that attention weights usually fail to correlate well with their contributions to the final prediction. This echoes with our finding that the original Att strategy performs poorly in length compression. However, the attention weights may play different roles in attribution and HSK distillation. Whether the findings in attribution are transferable to HSK distillation is still a problem that needs further investigation.

Conclusions and Future Work
In this paper, we investigate the compression of HSK in BERT KD. We divide the HSK of BERT into three dimensions and explore a range of compression strategies for each single dimension. On this basis, we jointly compress the three dimensions and find that, with a tiny fraction of HSK, the student can achieve the same or even better performance as distilling the full-scale knowledge. Based on this finding, we propose a new paradigm to improve the training efficiency in BERT KD, which does not require loading the teacher model during training. The experiments show that the training speed can be increased by 2.7× ∼ 3.4× for two kinds of student models and two types of CPU and GPU devices.
Most of the compression strategies investigated in this study are heuristic, which still have room for improvement. Therefore, a future direction of our work could be designing more advanced algorithm to search for the most useful HSK in BERT KD. Additionally, since HSK distillation in the pre-training stage is orders of magnitude time-consuming than task-specific distillation, the marginal utility diminishing effect in pre-training distillation is also a problem worth studying. Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.

A Architecture of Student Models
TinyBERT (Jiao et al., 2020) rescales the structure of BERT from the number of layers, the dimension of the Transformer layer outputs, and the hidden dimension of feed-forward networks. We use the 4-layer version (14.5M parameters) of TinyBERT that is released by Jiao et al. (2020).
ROSITA  compresses BERT from four structural dimensions, namely the layer, attention heads, the hidden dimension of the feedforward network and the rank of SVD to compress the embedding matrix. In practice, we scale the four dimensions to construct a 6-layer model ROSITA 6 that has approximately the same size as TinyBERT 4 . ROSITA 6 has 6 layers and 2 attention heads, and the FFN dimension and embedding matrix rank are 768 and 128 respectively.

B Hyperparameters
Following Jiao et al. (2020), we first distill HSK and then distill the teacher's predictions. The hyperparamers for HSK distillation basically follow Jiao et al. (2020), except that the training epoch of CoLA is changed from 50 to 30, the training epoch of QNLI is changed from 10 to 5, and the batch size for MNLI and QNLI is changed from 256 to 64. For prediction distillation, we use the linear decaying learning rate schedule. For each model and dataset, we tune the number of epoch (from {5, 10}) and learning rate (from {2e −5 , 5e −5 }) for the baseline method that use the uniform layer-wise strategy g(l, 12), and the hyperparameters are used for all the results with compressed HSK. Table 4 summarizes the hyperparameters.

C Configurations of 3D Compression Strategy
As described in the paper, for each A HSK we can obtain a number of configurations. Specifically, when we use the ROSITA 6 there are 13, 21, 45, 75, 112 configurations for A HSK = 1 ± 10%, 3 ± 10%, 5 ± 10%, 10 ± 10%, 50 ± 10% respectively. We randomly sample subsets of configurations in our experiments, the statistics of which is shown in  Table 5.

D Experimental Settings for Efficiency Evaluation
In Figure 9, we show the training and inference time of two models on two devices. The training time is computed as the time to run 500 training steps (i.e, batches of data). When it comes to inference, we run the models on the entire training set and dev set for GPU and CPU respectively. For training, the batch size is set to 64 and 16 for GPU and CPU respectively. For inference, we set the batch size to 128 and 1 for GPU and CPU respectively. The maximum sequence length is 128 for all the settings. For offline distillation, we use the configuration (1, 9, 0.1).

E.1 Full Results of Depth Compression
Depth compression results on all seven tasks are presented in Figure 11. Like the results on CoLA, SST-2, QNLI and MNLI, the results on MRPC, RTE and STS-B also suggest that the redesigned mapping functions (i.e., g(l, 8) and g(l, 10)) generally outperforms the original uniform mapping function g(l, 12), especially when HSK is compressed to one layer.

E.2 Full Results of Length Compression
Length compression results on all seven tasks are presented in Figure 12 and Figure 13. As we can see, the general trends on MRPC, RTE and STS-B are in accordance with the other four tasks, whose results are discussed in Section 4.  Figure 14 shows the proportion of data samples where [SEP] is the top1 and top3 most attended token. We can see that for most data samples, [SEP] is among the top3 tokens and frequently appears as the top1 from the 4 th ∼ 10 th layers. This pattern is consistent across the seven tasks.   Figure 15 shows the full results of width compression on all seven tasks. We can see that the gap between compression strategies is larger for ROSITA 6 , as compared with TinyBERT 4 . Among the three strategies, Mag Mask clearly outperforms Rand Mask and Uniform Mask.