MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

We generalize deep self-attention distillation in MiniLM (Wang et al., 2020) by only using self-attention relation distillation for task-agnostic compression of pretrained Transformers. In particular, we define multi-head self-attention relations as scaled dot-product between the pairs of query, key, and value vectors within each self-attention module. Then we employ the above relational knowledge to train the student model. Besides its simplicity and unified principle, more favorably, there is no restriction in terms of the number of student's attention heads, while most previous work has to guarantee the same head number between teacher and student. Moreover, the fine-grained self-attention relations tend to fully exploit the interaction knowledge learned by Transformer. In addition, we thoroughly examine the layer selection strategy for teacher models, rather than just relying on the last layer as in MiniLM. We conduct extensive experiments on compressing both monolingual and multilingual pretrained models. Experimental results demonstrate that our models distilled from base-size and large-size teachers (BERT, RoBERTa and XLM-R) outperform the state-of-the-art.


Introduction
Pretrained Transformers (Radford et al., 2018;Devlin et al., 2018;Radford et al., 2019;Raffel et al., 2019; have been highly successful for a wide range of natural language processing tasks. However, these models usually consist of hundreds of millions of parameters and are getting bigger. It brings challenges for fine-tuning and online serving in real-life applications due to the restrictions of computation resources and latency.
Knowledge distillation (KD; Hinton et al. 2015, Romero et al. 2015 has been widely employed to compress pretrained Transformers, which transfers knowledge of the large model (teacher) to the small model (student) by minimizing the differences between teacher and student features. Soft target probabilities (soft labels) and intermediate representations are usually utilized to perform KD training. In this work, we focus on task-agnostic compression of pretrained Transformers (Sanh et al., 2019;Tsai et al., 2019;Jiao et al., 2019;Sun et al., 2019b;. The student models are distilled from large pretrained Transformers using large-scale text corpora. The distilled task-agnostic model can be directly finetuned on downstream tasks, and can be utilized to initialize task-specific distillation. DistilBERT (Sanh et al., 2019) uses soft target probabilities for masked language modeling predictions and embedding outputs to train the student. The student model is initialized from the teacher by taking one layer out of two. Tiny-BERT (Jiao et al., 2019) utilizes hidden states and self-attention distributions (i.e., attention maps and weights), and adopts a uniform function to map student and teacher layers for layer-wise distillation. MobileBERT (Sun et al., 2019b) introduces specially designed teacher and student models using inverted-bottleneck and bottleneck structures to keep their layer number and hidden size the same, layer-wisely transferring hidden states and self-attention distributions. MINILM  proposes deep self-attention distillation, which uses self-attention distributions and value relations to help the student to deeply mimic teacher's self-attention modules. MINILM shows that transferring knowledge of teacher's last layer achieves better performance than layer-wise distillation. In summary, most previous work relies on self-attention distributions to perform KD training, which leads to a restriction that the number of attention heads of student model has to be the same as its teacher.
In this work, we generalize and simplify deep self-attention distillation of MINILM  by using self-attention relation distillation. We introduce multi-head self-attention relations computed by scaled dot-product of pairs of queries, keys and values, which guides the student training. Taking query vectors as an example, in order to obtain queries of multiple relation heads, we first concatenate query vectors of different attention heads, and then split the concatenated vector according to the desired number of relation heads. Afterwards, for teacher and student models with different head numbers, we can align their queries with the same number of relation heads for distillation. Moreover, using a larger number of relation heads brings more fine-grained self-attention knowledge, which helps the student to achieves a deeper mimicry of teacher's self-attention module. In addition, for large-size (24 layers, 1024 hidden size) teachers, extensive experiments indicate that transferring an upper middle layer tends to perform better than using the last layer as in MINILM.
Experimental results show that our student models distilled from BERT and RoBERTa both outperform state-of-the-art models in different parameter sizes. The 6×768 (6 layers, 768 hidden size) model distilled from BERT LARGE is 2.0× faster, meanwhile, achieving better performance than BERT BASE . The base-size model distilled from RoBERTa LARGE outperforms RoBERTa BASE even using much fewer training examples.

Backbone Network: Transformer
Multi-layer Transformer (Vaswani et al., 2017) has been widely adopted in pretrained models. Each Transformer layer consists of a self-attention sublayer and a position-wise fully connected feedforward sub-layer.
Self-Attention Transformer relies on multi-head self-attention to capture dependencies between words. Given previous Transformer layer's output H l−1 ∈ R |x|×d h , the output of a self-attention head O l,a , a ∈ [1, A h ] is computed via: Previous layer's output H l−1 is linearly projected to queries, keys and values using parameter matrices W Q l,a , W K l,a , W V l,a ∈ R d h ×d k , respectively. The self-attention distributions are computed via scaled dot-product of queries and keys. These weights are assigned to the corresponding value vectors to obtain the attention output. |x| represents the length of input sequence. A h and d h indicate the number of attention heads and hidden size. d k is the attention head size. d k × A h is usually equal to d h .

Pretrained Language Models
Pre-training has led to strong improvements across a variety of natural language processing tasks. Pretrained language models are learned on large amounts of text data, and then fine-tuned to adapt to specific tasks. BERT (Devlin et al., 2018) proposes to pretrain a deep bidirectional Transformer using masked language modeling (MLM) objective. UNILM  is jointly pretrained on three types language modeling objectives to adapt to both understanding and generation tasks. XL-Net  introduces permutation language modeling objective to predict masked tokens auto-regressively. SpanBERT  improves BERT by incorporating span information. RoBERTa  achieves strong performance by training longer steps using large batch size and more text data. MASS (Song et al., 2019), T5 (Raffel et al., 2019) and BART ) employ a standard encoder-decoder structure and pretrain the decoder auto-regressively.  propose a pseud-masked language model by jointly pretrained on MLM and partially auto-regressive MLM objectives. Besides monolingual pretrained models, multilingual pretrained models (Devlin et al., 2018;Lample and Conneau, 2019;Chi et al., 2019;Chi et al., 2020) also advance the state-of-the-art on cross-lingual understanding and generation benchmarks.  Transformer Block 1

Multi-Head Dot-Product
Multi-Head Dot-Product Figure 1: Overview of multi-head self-attention relation distillation. We introduce multi-head self-attention relations computed by scaled dot-product of pairs of queries, keys and values to guide the training of students. In order to obtain self-attention vectors (queries, keys and values) of multiple relation heads, we first concatenate self-attention vectors of different attention heads and then split them according to the desired number of relation heads. For large-size teacher, we transfer the self-attention knowledge of an upper middle layer of the teacher. For base-size teacher, using the last layer achieves better performance. Our student models are named as MINILMv2.

Knowledge Distillation
Knowledge distillation has been proven to be a promising way to compress large models while maintaining accuracy. Knowledge of a single or an ensemble of large models is used to guide the training of small models. Hinton et al. (2015) propose to use soft target probabilities to train student models. More fine-grained knowledge such as hidden states (Romero et al., 2015) and attention distributions (Zagoruyko and Komodakis, 2017;Hu et al., 2018) are introduced to improve the student model. In this work, we focus on task-agnostic knowledge distillation of pretrained Transformers. The distilled task-agnostic model can be fine-tuned to adapt to downstream tasks. It can also be utilized to initialize task-specific distillation (Sun et al., 2019a;Turc et al., 2019;Aguilar et al., 2019;Mukherjee and Awadallah, 2020;Hou et al., 2020;, which uses a finetuned teacher model to guide the training of the student on specific tasks. Knowledge used for distillation and layer mapping function are two key points for task-agnostic distillation of pretrained Transformers. Most previous work uses soft target probabilities, hidden states, self-attention distributions and value-relation to train the student model. For the layer mapping function, TinyBERT (Jiao et al., 2019) uses a uniform strategy to map teacher and student layers. MobileBERT (Sun et al., 2019b) assumes the student has the same number of layers as its teacher to perform layer-wise distillation. MINILM  transfers selfattention knowledge of teacher's last layer to the student last Transformer layer. Different from previous work, our method uses multi-head selfattention relations to eliminate the restriction on the number of student's attention heads. Moreover, we show that transferring the self-attention knowledge of an upper middle layer of the large-size teacher model is more effective.

Multi-Head Self-Attention Relation Distillation
Following MINILM, the key idea of our approach is to deeply mimic teacher's self-attention module, which draws dependencies between words and is the vital component of Transformer. MINILM uses teacher's self-attention distributions to train the student model. It brings the restriction on the number of attention heads of students, which is required to be the same as its teacher. To achieve a deeper mimicry and avoid using teacher's selfattention distributions, we introduce multi-head self-attention relations of pairs of queries, keys and values to train the student. Besides, we conduct extensive experiments and find that layer selection of  (Turc et al., 2019) is trained using the MLM objective, without using knowledge distillation. We also report the results of truncated BERT BASE and truncated RoBERTa BASE , which drops the top 6 layers of the base model. Top-layer dropping has been proven to be a strong baseline (Sajjad et al., 2020). The fine-tuning results are an average of 4 runs.
the teacher model is critical for distilling large-size models. Figure 1 gives an overview of our method.

Multi-Head Self-Attention Relations
Multi-head self-attention relations are obtained by scaled dot-product of pairs 3 of queries, keys and values of multiple relation heads. Taking query vectors as an example, in order to obtain queries of multiple relation heads, we first concatenate queries of different attention heads and then split the concatenated vector based on the desired number of relation heads. The same operation is also performed on keys and values. For teacher and student models which uses different number of attention heads, we convert their queries, keys and values of different number of attention heads into vectors of the same number of relation heads to perform KD training. Our method eliminates the restriction on the number of attention heads of student models. Moreover, using more relation heads in computing self-attention relations brings more fine-grained self-attention knowledge and improves the performance of the student model. We use A 1 , A 2 , A 3 to denote the queries, keys and values of multiple relation heads. The KLdivergence between multi-head self-attention re-2 In addition to task-agnostic distillation, TinyBERT uses task-specific distillation and data augmentation to further improve the model. We report the fine-tuning results of their public task-agnostic model. 3 There are nine types of self-attention relations, such as query-query, key-key, key-value and query-value relations. lations of the teacher and student is used as the training objective: R T ij,l,a = softmax( R S ij,m,a = softmax( where A T i,l,a ∈ R |x|×dr and A S i,m,a ∈ R |x|×d r (i ∈ [1, 3]) are the queries, keys and values of a relation head of l-th teacher layer and m-th student layer. d r and d r are the relation head size of teacher and student models. R T ij,l ∈ R Ar×|x|×|x| is the selfattention relation of A T i,l and A T j,l of teacher model. R S ij,m ∈ R Ar×|x|×|x| is the self-attention relation of student model. For example, R T 11,l represents teacher's Q-Q relation in Figure 1. A r is the number of relation heads. α ij ∈ {0, 1} is the weight assigned to each self-attention relation loss. We find that only using the query-query, key-key and value-value relations achieves competitive performance.

Layer Selection of Teacher Model
Besides the knowledge used for distillation, mapping function between teacher and student layers is also important. As in MINILM, we only transfer the self-attention knowledge of one of the teacher layers to the student last layer. Different from previous work which usually conducts experiments on base-size models, we experiment with different large-size teachers and find that transferring self-attention knowledge of an upper middle layer performs better than using other layers. For BERT LARGE and BERT LARGE-WWM , transferring the 21-th (start at one) layer achieves the best performance. For RoBERTa LARGE , using the selfattention knowledge of 19-th layer achieves better performance. For the base-size model, experiments indicate that using teacher's last layer performs better than other layers.

Experiments
We conduct distillation experiments on different teacher models including BERT BASE , BERT LARGE , BERT LARGE-WWM and RoBERTa LARGE . We use multi-head query-query, key-key and value-value relations to perform KD training.

Setup
We use the uncased version for three BERT teacher models (BERT BASE , BERT LARGE and BERT LARGE-WWM (Zhu et al., 2015), and follow the preprocess and the WordPiece tokenization of Devlin et al. (2018). We train student models using 256 as the batch size and 6e-4 as the peak learning rate for 400, 000 steps. We use linear warmup over the first 4, 000 steps and linear decay. We use Adam (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.999. The maximum sequence length is set to 512. The dropout rate and weight decay are 0.1 and 0.01. The number of attention heads is 12 for all student models. For BERT LARGE and BERT LARGE-WWM , we use the self-attention knowledge of the 21-th layer to train the student model. The number of relation heads is 48 and 64 for basesize and large-size teacher model, respectively. The student models are initialized randomly.
For RoBERTa LARGE , we use similar pre-training datasets as in , which includes 160GB text corpora from English Wikipedia, Book-Corpus (Zhu et al., 2015), OpenWebText, CC-News , and Stories (Trinh and Le, 2018). We use the self-attention knowledge of teacher's 19-th layer for training the small models. For the 12×768 (base-size) student model, we use Adam with β 1 = 0.9, β 2 = 0.98. The rest hyperparameters are the same as models distilled from BERT. We conduct distillation experiments using 8 V100 GPUs with mixed precision training.

Downstream Tasks
Following previous pre-training (Devlin et al., 2018; and task-agnostic distillation (Sun et al., 2019b;Jiao et al., 2019) work, we evaluate the models on GLUE benchmark and extractive question answering.

Extractive Question Answering
The task aims to predict a continuous sub-span of the passage to answer the question. We evaluate on SQuAD 2.0 (Rajpurkar et al., 2018), which has been served as a major question answering benchmark. We pack the question and passage tokens together with special tokens to form the input: "  (3) Student models distilled from RoBERTa LARGE achieve further improvements. Better teacher results in better students. Self-attention relation distillation is effective for different large-size pretrained Transformers. MobileBERT compresses a specially designed teacher model (in the BERT LARGE size) with inverted bottleneck modules into a 24-layer student using the bottleneck modules. To compare with MobileBERT, we use a public large-size model (BERT LARGE-WWM ) as the teacher, which achieves similar performance as the teacher of Mobile-BERT. We distill BERT LARGE-WWM into a student model, which contains the same number of parameters (25M parameters, 12×384 with 128 embedding size), using the same training data (English Wikipedia and BookCorpus). The test results of GLUE benchmark and dev result of SQuAD 2.0 are illustrated in Table 3. MINILMv2 outperforms Mo-bileBERT across most tasks with a faster inference speed. Moreover, our method can be applied for different teachers and has much fewer restrictions of student models.

Main Results
We compress RoBERTa LARGE and BERT LARGE into a base-size student model. Dev results of   Table 4. Our base-size models distilled from large-size teacher outperforms BERT BASE and RoBERTa BASE . Our method can also be employed to train a base-size model. Moreover, MINILMv2 distilled from RoBERTa LARGE uses a much smaller (almost 32× smaller) training batch size and fewer training steps than RoBERTa BASE . Our method uses much fewer training examples and has a lower computation cost.

Ablation Studies
Effect of distilling different teacher layer Figure 2 and 3 present the results of 6×384 model distilled from different layers of BERT BASE and BERT LARGE . For BERT BASE , using the last layer achieves better performance than other layers. For BERT LARGE , we find that using one of the upper middle layers achieves the best performance. The same trend is also observed for BERT LARGE-WWM and RoBERTa LARGE . Table 5 shows the results of 6×384 model distilled  from BERT BASE using different number of relation heads. Using a larger number of relation heads achieves better performance. More fine-grained self-attention knowledge can be captured by using more relation heads, which helps the student to deeply mimic the self-attention module of its teacher.

Conclusion
In this work, we present a simple and effective approach for compressing pretrained Transformers. We employ multi-head self-attention relations to train the student to deeply mimic the self-attention module of its teacher. Our method eliminates the restriction of the number of student's attention heads, which is required to be the same as its teacher for previous work transferring self-attention distributions. Moreover, we show that transferring the self-attention knowledge of an upper middle layer achieves better performance for large-size teacher models. Our student models distilled from BERT and RoBERTa obtain competitive performance on SQuAD 2.0 and the GLUE benchmark, and outperform state-of-the-art methods. For future work, we are exploring an automatic layer selection algorithm. We also would like to apply our method to larger pretrained Transformers.