AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models

Pre-trained language models (PLMs) have achieved great success in natural language processing. Most of PLMs follow the default setting of architecture hyper-parameters (e.g., the hidden dimension is a quarter of the intermediate dimension in feed-forward sub-networks) in BERT. Few studies have been conducted to explore the design of architecture hyper-parameters in BERT, especially for the more efficient PLMs with tiny sizes, which are essential for practical deployment on resource-constrained devices. In this paper, we adopt the one-shot Neural Architecture Search (NAS) to automatically search architecture hyper-parameters. Specifically, we carefully design the techniques of one-shot learning and the search space to provide an adaptive and efficient development way of tiny PLMs for various latency constraints. We name our method AutoTinyBERT and evaluate its effectiveness on the GLUE and SQuAD benchmarks. The extensive experiments show that our method outperforms both the SOTA search-based baseline (NAS-BERT) and the SOTA distillation-based methods (such as DistilBERT, TinyBERT, MiniLM, and MobileBERT). In addition, based on the obtained architectures, we propose a more efficient development method that is even faster than the development of a single PLM. The source code and models will be publicly available upon publication.

Brown et al., 2020)) follow the default rule of hyper-parameter setting 2 in BERT to scale up their model sizes. Due to its simplicity, this rule has been widely used and can help large PLMs obtain promising results (Brown et al., 2020). In many industrial scenarios, we need to deploy PLMs on resource-constrained devices, such as smartphones and servers with limited computation power. Due to the expensive computation and slow inference speed, it is usually difficult to deploy PLMs such as BERT (12/24 layers, 110M/340M parameters) and GPT-2 (48 layers, 1.5B parameters) at their original scales. Therefore, there is an urgent need to develop PLMs with smaller sizes which have lower computation cost and inference latency. In this work, we focus on a specific type of efficient PLMs, which we define to have inference time less than 1/4 of  Although, there have been quite a few work using knowledge distillation to build small PLMs (Sanh et al., 2019;Jiao et al., 2020b;Sun et al., 2019Sun et al., , 2020, all of them focus on the application of distillation techniques (Hinton et al., 2015;Romero et al., 2014) and do not study the effect of architecture hyper-parameter settings on model performance. Recently, neural architecture search and hyper-parameter optimization (Tan and Le, 2019;Han et al., 2020) have been widely explored in machine learning, mostly in computer vision, and have been proven to find better designs than heuristic ones. Inspired by this research, one problem that naturally arises is can we find better settings of hyper-parameters 4 for efficient PLMs?
In this paper, we argue that the conventional hyper-parameter setting is not best for efficient PLMs (as shown in Figure 1) and introduce a method to automatically search for the optimal hyper-parameters for specific latency constraints. Pre-training efficient PLMs is inevitably resourceconsuming (Turc et al., 2019). Therefore, it is infeasible to directly evaluate millions of architectures. To tackle this challenge, we introduce the one-shot Neural Architecture Search (NAS) (Brock et al., 2018;Cai et al., 2018;Yu et al., 2020) to perform the automatic hyper-parameter optimization on efficient PLMs, named as AutoTinyBERT. Specifically, we first use the one-shot learning to obtain a big SuperPLM, which can act as proxies for all potential sub-architectures. Proxy means that when evaluating an architecture, we only need to extract the corresponding sub-model from the SuperPLM, instead of training the model from scratch. Super-PLM helps avoid the time-consuming pre-training process and makes the search process efficient. To make SuperPLM more effective, we propose practical techniques including the head sub-matrix extraction and efficient batch-wise training, and particularly limit the search space to the models with identical layer structure. Furthermore, by using SuperPLM, we leverage search algorithm (Xie and Yuille, 2017;Wang et al., 2020a) to find hyperparameters for various latency constraints.
In the experiments, in addition to the pre-training setting (Devlin et al., 2019), we also consider the setting of task-agnostic BERT distillation (Sun et al., 2020) that pre-trains with the loss of knowledge distillation, to build efficient PLMs. Exten-sive results show that in pre-training setting, Au-toTinyBERT not only consistently outperforms the BERT with conventional hyper-parameters under different latency constraints, but also outperforms NAS-BERT based on neural architecture search. In task-agnostic BERT distillation, AutoTinyBERT outperforms a series of existing SOTA methods of DistilBERT, TinyBERT and MobileBERT.
Our contributions are three-fold: (1) we explore the problem of how to design hyper-parameters for efficient PLMs and introduce an effective and efficient method: AutoTinyBERT; (2) we conduct extensive experiments in both scenarios of pretraining and knowledge distillation, and the results show our method consistently outperforms baselines under different latency constraints; (3) we summarize a fast rule and it develops an AutoTiny-BERT for a specific constraint with even about 50% of the training time of a conventional PLM.

Preliminary
Before presenting our method, we first provide some details about the Transformer layer (Vaswani et al., 2017) to introduce the conventional hyperparameter setting. Transformer layer includes two sub-structures: the multi-head attention (MHA) and the feed-forward network (FFN).
For clarity, we show the MHA as a decomposable structure, where the MHA includes h individual and parallel self-attention modules (called heads). The output of MHA is obtained by summing the output of all heads. Specifically, each head is represented by four main matrices and takes the hidden states 5 H ∈ R l×d m of the previous layer as input. The output of MHA is given by the following formulas: Sub-matrix (width-wise) extraction.
| | | |1|2 Figure 2: Overview of AutoTinyBERT. We first train an effective SuperPLM with one-shot learning, where the objectives of pre-training or task-agnostic BERT distillation are used. Then, given a specific latency constraint, we perform an evolutionary algorithm on the SuperPLM to search optimal architectures. Finally, we extract the corresponding sub-models based on the optimal architectures and further train these models.
scaled dot-product attention operation. Then output of each head is transformed to In the conventional setting of the hyper-parameters in BERT, all dimensions of matrices are the same as the dimension of the hidden vector, namely, d q|k|v|o =d m . In fact, there are only two requirements of d q =d k and d o =d m that must be satisfied because of the dot-product attention operation in MHA and the residual connection.
Transformer layer also contains an FFN that is stacked on the MHA, that is: Similarly, there are modules of residual connection and layer normalization on top of FFN. In the original Transformer, d f =4d m is assumed. Thus, we conclude that the conventional hyper-parameter setting follows the rule of {d q|k|v|o =d m , d f =4d m }.

Problem Statement
Given a constraint of inference time, our goal is to find an optimal configuration of architecture hyperparameters α opt built with which PLM can achieve the best performances on downstream tasks. This optimization problem is formulated as: where T is a specific time constraint, A refers to the set of all possible architectures (i.e., combination of hyper-parameters), Lat(·) is a latency evaluator, L α (·) denotes the loss function of PLMs with the hyper-parameter α, and θ is the corresponding model parameters. We aim to search an optimal architecture for efficient PLM (Lat(α) < 1/4 × Lat(BERT base )).

Overview
A straightforward way to get the optimal architecture is to enumerate all possible architectures. However, it is infeasible because each trial involves a time-consuming pre-training process. Therefore, we introduce one-shot NAS to search α opt , as shown in the Figure 2. The proposed method includes three stages: (1) the one-shot learning to obtain SuperPLM that can be used as the proxy for (2) the search process for the optimal hyper-parameters; (3) the further training with the optimal architectures and corresponding sub-models. In the following sections, we first introduce the search space, which is the basis for the one-shot learning and search process. Then we present the three stages respectively.

Search Space
From the Section 2, we know that the conventional hyper-parameter setting is: {d q|k|v|o =d m , d f =4d m }, which is widely-used in PLMs. The architecture of a PLM is parameterized as: Let l t denote the layer number and d * refer to different dimensions in the Transformer layer. We denote the search space of l t and d * as A l t and A d * respectively. The overall search space is: In this work, we only consider the case of identical structure for each Transformer layer, instead of the non-identical Transformer (Wang et al., 2020a) or other heterogeneous modules (Xu et al., 2021) (such as convolution units). It has two advantages: (1) it reduces an exponential search space , greatly reducing the number of possible architectures in SuperPLM training and the exploration space in the search process. It leads to a more efficient search process. (2) An identical and homogeneous structure is in fact more friendly to hardware and software frameworks, e.g., Hugging Face Transformer (Wolf et al., 2020). With a end for 13: end for few changes, we can use the original code to use AutoTinyBERT, as shown in Appendix A.

One-shot Learning for SuperPLM
We employ the one-shot learning (Brock et al., 2018;Yu et al., 2020) to obtain a SuperPLM whose sub-models can act as the proxy for PLMs trained from scratch. The configurations of SuperPLM in this work are l t =8, d m|q|k|v|o =768, and d f =3072. In each step of the one-shot learning, we train several sub-models randomly sampled from Su-perPLM to make their performance close to the models trained from scratch. Although the sampling/search space has been reduced to linear complexity, there are still more than 10M possible substructures in SuperPLM (the details are shown in the Appendix B). Therefore, we introduce an effective batch-wise training method to cover the sub-models as much as possible. Specifically, in parallel training, we first divide each batch into multiple sub-batches and distribute them to different threads as parallel training data. Then, we sample several sub-models on each thread for training and merge the gradients of all threads to update the SuperPLM parameters. We illustrate the training process in the Algorithm 1.
Given a specific hyper-parameter setting α = we get a sub-model from SuperPLM by the depth-wise and widthwise extraction. Specifically, we first perform the depth-wise extraction that extracts the first l t Trans- We report the average score excluding SQuAD (Score) in addition to the average score of all tasks (Avg.). The speedup is in terms of the BERT base inference speed and evaluated on a single CPU with a single input of 128 length. PF-xLyD, the x and y refer to the layer number and hidden dimension respectively. †denotes that the results are taken from (Xu et al., 2021) and ‡denotes that the results are obtained by fine-tuning the released models.
former layers from SuperPLM, and then perform the width-wise extraction that extracts bottom-left sub-matrices from original matrices. For MHA, we apply two strategies illustrated in Figure 3 : (1) keep the dimension of each head same as Super-PLM, and extract some of the heads; (2) keep the head number same as SuperPLM, and extract subdimensions from each head. The first strategy is the standard one and we use it for pre-training and the second strategy is used for task-agnostic distillation because that attention-based distillation (Jiao et al., 2020b) requires the student model to have the same head number as the teacher model.

Search Process
In the search process, we adopt an evolutionary algorithm (Xie and Yuille, 2017;Jiao et al., 2020a), where Evolver and Evaluator interact with each other to evolve better architectures. Our search process is efficient, as shown in the Section 4.4. Specifically, Evolver firstly samples a generation of architectures from A. Then Evaluator extracts the corresponding sub-models from SuperPLM and ranks them based on their performance on tasks of SQuAD and MNLI. The architectures with the high performance are chosen as the winning architectures and Evolver performs the mutation Mut(·) operation on the winning ones to produce a new generation of architectures. This process is conducted repeatedly. Finally, we choose several architectures with the best performance for further training. We use Lat(·) to predict the latency of the candidates to filter out the candidates that do not meet the latency constraint. Lat(·) is built with the method by Wang et al. (2020a), which first samples about 10k architectures from A and collects their inference time on target devices, and then uses a feed-forward network to fit the data. For more details of evolutionary algorithm, please refer to Appendix C. Note that we can use different methods in search process, such as random search and more advanced search, which is left as future work.

Further Training
The search process produces top several architectures, with which we extract these corresponding sub-models from SuperPLM and continue training them using the pre-training or KD objectives.

Experimental Setup
Dataset and Fine tuning. We conduct the experiments on the GLUE benchmark (Wang et al., 2018) and SQuADv1.1 (Rajpurkar et al., 2016). For GLUE, we set the batch size to 32, choose the learning rate from {1e-5, 2e-5, 3e-5} and choose the epoch number from {4, 5, 10}. For SQuADv1.1, we set the batch size to 16, the learning rate to 3e-5 and the epoch number to 4. The details for all datasets are displayed in Appendix D.
AutoTinyBERT. Both the one-shot and further

Dev results on GLUE and dev result on SQuAD
AutoTinyBERT-KD-S1  Table 2: Comparison between AutoTinyBERT and baselines based on knowledge distillation. ‡ denotes that the results are taken from (Sun et al., 2020) and † means the models trained using the released code or the reimplemented code with ELECTRA base as the teacher model. ¶ means these tasks use accuracy for dev set and F1 for test set respectively. § denotes the task-agnostic TinyBERT without task-specific distillation. * means that the speedup is different from the (Sun et al., 2020), because it is evaluated on a Pixel phone and not on server CPUs.means that the results are missing in the original paper. Other information refer to the Table 1. training use BooksCorpus (Zhu et al., 2015) and English Wikipedia as training data. The settings for one-shot training are: peak learning rate of 1e-5, warmup rate of 0.1, batch size of 256 and 5 running epochs. Further training follows the same setting as the one-shot training except for the warmup rate of 0. In the batch-wise training algorithm 1, the thread number N is set to 16, the sample times M per batch is set to 3, and epoch number E is set to 5. We train the SuperPLM with an architecture of {l t =8, d m|q|k|v|o =768, d f =3072}. In the search process, Evolver performs 4 iterations with a population size of 25 and it chooses top three architectures for further training. For more details of the sampling/search space and evolutionary algorithm, please refer to Appendix B and C.
We train AutoTinyBERT in both ways of pretraining (Devlin et al., 2019) and task-agnostic BERT distillation (Sun et al., 2020). For taskagnostic distillation, we follow the first stage of TinyBERT (Jiao et al., 2020b) except that only the last-layer loss is used, and ELECTRA base (Clark et al., 2019) is used as the teacher model.
Baselines. For the pre-training baselines, we include PF (Pre-training + Fine-tuning, proposed by Turc et al. (2019)), BERT-S* (BERT under several hyper-parameter configurations), and NAS-BERT (Xu et al., 2021). Both PF and BERT-S* follow the conventional setting rule of hyper-parameters. BERT-S* uses the training setting: peak learning rate of 1e-5, warmup rate of 0.1, batch size of 256 and 10 running epochs. NAS-BERT searches the architecture built on the nonidentical layer and heterogeneous modules. For the distillation baselines, we compare some typical methods, including DistilBERT, BERT-PKD, Tiny-BERT, MiniLM, and MobileBERT. The first four methods use the conventional architectures. Mo-bileBERT is equipped with a bottleneck structure and a carefully designed balance between MHA and FFN. We also consider BERT-KD-S*, which use the same training setting of BERT-S*, except for the loss of knowledge distillation. BERT-KD-S* also uses ELECTRA base as the teacher model.

Results and Analysis
The experiment is conducted under different latency constraints that are from 4× to 30× faster than the inference of BERT base . The results of pretraining and task-agnostic distillation are shown in the Table 1 and Table 2 respectively.
We observe that in the settings of the pre-training and knowledge distillation, the performance gap of different models with similar inference time is obvious, which shows the necessity of architecture optimization for efficient PLMs. In the Table 1, some observations are: (1) the architecture optimization methods of AutoTinyBERT and NAS-BERT outperform both BERT and PF that use the default   architecture hyper-parameters; (2) our method outperforms NAS-BERT that is built with the nonidentical layer and heterogeneous modules, which shows that the proposed method is effective for the architecture search of efficient PLMs. In the Table 2, we observe that: (1) our method consistently outperforms the conventional structure in all the speedup constraints; (2) our method outperforms the classical distillation methods (e.g., BERT-PKD, DistilBERT, TinyBERT, and MiniLM) that use the conventional architecture. Moreover, AutoTiny-BERT achieves comparable results with Mobile-BERT, and its inference speed is 1.5× faster.

Ablation Study of One-shot Learning
We demonstrate the effectiveness of one-shot learning by comparing the performance of one-shot model and stand-alone trained model on the given architectures. We choose 16 architectures and their corresponding PF models 6 as the evaluation benchmark. The pairwise accuracy is used as a metric to indicate the ranking correction between the architectures under one-shot training and the ones under stand-alone full training (Luo et al., 2019) and its formula is described in Appendix E. We do the ablation study to analyze the effect of proposed identical layer structure (ILS), MHA sub-matrix extraction (SME) and effective batchwise learning (EBL) on SuperPLM learning. More-6 The first 16 models https://github.com/ google-research/bert from 2L128D to 8L768D.

Version BERT AutoTinyBERT Speedup
Pre-training
over, we introduce HAT (Wang et al., 2020a), as a baseline of one-shot learning. HAT focuses on the search space of non-identical layer structures. The results are displayed in Table 3 and Figure 4.
It can be seen from the figure that compared with stand-alone trained models, the HAT baseline has a significant performance gap, especially in small sizes. Both ILS and SME benefit the one-shot learning for large and medium-sized models. When further combined with EBL, SuperPLM can obtain similar or even better results than stand-alone trained models of small sizes and perform close to stand-alone trained models of big sizes. The results of the table show that: (1) the proposed techniques have positive effects on SuperPLM learning, and EBL brings a significant improvement on a challenging task of SQuAD; (2) SuperPLM achieves a high pairwise accuracy of 96.7% which indicates that the proposed SuperPLM can be a good proxy model for the search process; (3) the performance of SuperPLM is still a little worse than the standalone trained model and we need to do the further training to boost the performance.

Fast Development of Efficient PLM
In this section, we explore an effective setting rule of hyper-parameters based on the obtained architectures and also discuss the computation cost of the development of efficient PLM. The conventional and new architectures are displayed in Table 4. We observe that AutoTinyBERT follows an obvious rule (except the S3 model) in the speedup constraints that are from 4× to 30×. The rule is summarized as: With the above rule, we propose a faster way to build efficient PLM, denoted as AutoTinyBERT-Fast. Specifically, we first obtain the candidates by the rule, and then select α opt from the candidates. We observe the fact that the candidates of the same layer number seem to have similar shapes and we assume that they have similar performance. Therefore, we only need to test one architecture at each layer number and choose the best one as α opt .
To demonstrate the effectiveness of the proposed method, we evaluate these methods at a new speedup constraint of about 10× under the pre-training setting. The results are shown in Table 5. We find AutoTinyBERT is efficient and its development time is twice that of the conventional method (BERT) and the result is improved by about 1.8%. AutoTinyBERT-Fast achieves a competitive score of 77.6 by only about 50% of BERT training time. In addition to the proposed search method and fast building rule, one reason for the high efficiency of AutoTinyBERT is that the initialization of SuperPLM helps the model to achieve 2× the convergence speedup, as illustrated in Figure 5.

Related Work
Efficient PLMs with Tiny sizes. There are two widely-used methods for building efficient PLMs: pre-training and model compression. Knowledge distillation (KD) (Hinton et al., 2015;Romero et al., 2014) is the most widely studied technique in PLM compression, which uses a teacher-student framework. The typical distillation studies include DistilBERT (Sanh et al., 2019), BERT-PKD (Sun et al., 2019), MiniLM (Wang et al., 2020b), Mobile-BERT (Sun et al., 2020), MiniBERT (Tsai et al., 2019) and ETD (Chen et al., 2021). In addition to KD, the techniques of pruning (Han et al., 2016;, quantization (Shen et al., 2020;Wang et al., 2020c) and parameter sharing (Lan et al., 2019) introduced for PLM compression. Our method is orthogonal to the building method of efficient PLM and is trained under the settings of pre-training and task-agnostic BERT distillation, which can be used by direct finetuning.
NAS for NLP. NAS is extensively studied in computer vision (Tan and Le, 2019;Tan et al., 2020), but relatively little studied in the natural language processing. Evolved Transformer (So et al., 2019) and HAT (Wang et al., 2020a) search architecture for Transformer-based neural machine translation. For BERT distillation, AdaBERT  focuses on searching the architecture in the fine-tuning stage and relies on data augmentation to improve its performance. schuBERT (Khetan and Karnin, 2020) obtains the optimal structures of PLM by a pruning method. A work similar to ours is NAS-BERT (Xu et al., 2021). It proposes some techniques to tackle the challenging exponential search space of non-identical layer structure and heterogeneous modules. Our method adopts a linear search space and introduces several practical techniques for SuperPLM training. Moreover, our method is efficient in terms of computation cost and the obtained PLMs are easy to use.

Conclusion
We propose an effective and efficient method Au-toTinyBERT to search for the optimal architecture hyper-parameters of efficient PLMs. We evaluate the proposed method in the scenarios of both the pre-training and task-agnostic BERT distillation. The extensive experiments show that AutoTiny-BERT can consistently outperform the baselines under different latency constraints. Furthermore, we develop a fast development rule for efficient PLMs which can build an AutoTinyBERT model even with less training time of a conventional one.

A Code Modifications for
AutoTinyBERT.
We modify the original code 8 to load AutoTiny-BERT model and present the details of code modifications in the Figure B.1. We assume that d q/k = d v , and more complicated setting is that d v can be different with d q/k , we can do corresponding changes based on the given modifications.
B Search Space of Architecture Hyper-parameters.
We has trained two SuperPLMs with a architecture of {l t =8, d m/q/k/v =768, d f =3072} to cover the two scenarios of building efficient PLMs (pretraining and task-agnostic BERT distillation). The sampling space in the SuperPLM training is the same as the search space in the search process, as shown in the

C Evolutionary Algorithm.
We give a detailed description of evolutionary algorithm in Algorithm 2.
D Hyper-parameters for Fine-Tuning.
Fine-tuning hyper-parameters of GLUE benchmark and SQuAD are displayed in

F More details for Fast Development of efficient PLM.
We present the detailed results and architecture hyper-parameters for fast development of efficient PLM in  Algorithm 2 The Evolutionary Algorithm 1: Input: the number of generations T = 4, the number of archtectures αs in each generation S = 25, the mutation Mut( * ) probability p m = 1/2, the exploration probability p e = 1/2. 2: Sample first generation G 1 from A, and Evoluator produces its performance V 1 . 3: for t = 2, 3 · · · , T do 4: while |G t | < S do 6: Sample one architecture: α with a Russian roulette process on G t−1 and V t−1 . 7: With probability p m , do Mut( * ) for α.

8:
With probability p e , sample a new architecture from A.

9:
Append the newly generated architectures into G t . Evaluator obtains V t for G t . 12: end for 13: Output: Output the α opt with best performance in the above process.