LV-BERT: Exploiting Layer Variety for BERT

Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order. Specifically, besides the original self-attention and feed-forward layers, we introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models. Furthermore, beyond the original interleaved order, we explore more layer orders to discover more powerful architectures. However, the introduced layer variety leads to a large architecture space of more than billions of candidates, while training a single candidate model from scratch already requires huge computation cost, making it not affordable to search such a space by directly training large amounts of candidate models. To solve this problem, we first pre-train a supernet from which the weights of all candidate models can be inherited, and then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture. Extensive experiments show that LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks. For example, LV-BERT-small achieves 79.8 on the GLUE testing set, 1.8 higher than the strong baseline ELECTRA-small.


Introduction
In recent years, pre-trained language models, such as the representative BERT (Devlin et al., 2019) and GPT-3 (Brown et al., 2020), have gained great success in natural language processing tasks (Peters et al., 2018a;Radford et al., 2018;Yang et al., 2019;Clark et al., 2020). The backbone architectures of these models mostly adopt a stereotyped 1 https://github.com/yuweihao/LV-BERT  (Devlin et al., 2019), the other models are pre-trained with Replaced Token Detection objective (Clark et al., 2020) to save computation cost. layer pattern, in which the self-attention and feedforward layers are arrayed in an interleaved order (Vaswani et al., 2017). However, there is no evidence supporting that this layer pattern is optimal (Press et al., 2020). We then consider a straightforward and interesting question: Could we change the layer pattern to improve pre-trained models?
We attempt to answer this question by exploiting more layer variety from two aspects, as shown in Figure 1(a): the layer type set and the layer order.
We first consider the layer types. In previous pre-trained language models, the most widely-used layer set contains the self-attention layer for capturing global information and the feed-forward layer for non-linear transformation. However, some recent works have unveiled that some self-attention heads in pre-trained models tend to learn local dependencies due to the inherent property of natural language (Kovaleva et al., 2019;Brunner et al., 2020;, incurring computation redundancy for capturing local information. In contrast, convolution is a local operator (LeCun et al., 1998;Krizhevsky et al., 2012;Simonyan and Zisserman, 2015;He et al., 2016) and has shown effectiveness on extracting local information for language models (Zeng et al., 2014;Kim, 2014;Kalchbrenner et al., 2014;Wu et al., 2018Wu et al., , 2019b. Thus, we propose to augment the layer set by including convolution for local information extraction. For layer orders, most of the existing pre-trained models adopt an interleaved order to arrange the different types of layers. Differently, Press et al. (2020) presented the sandwich order, i.e., stacking consecutive self-attention and feed-forward layers at the bottom and top, respectively, while keeping the interleaved order in the middle. It has been shown that the sandwich order can bring improvement on language modeling task, indicating the layer order contributes to model performance. However, Press et al. (2020) did not show the generalization capability of this order to other tasks. There is still a large room for exploring more effective orders for pre-trained models. We show the different layer variety designs of existing models in Figure 1(b), including BERT (Devlin et al., 2019)/ELECTRA (Clark et al., 2020), Dynamic-Conv (Wu et al., 2018) and Sandwich (Press et al., 2020). Their performance is summarized in Figure  1(c). It can be seen that layer variety significantly influences model performance. We thus claim it is necessary to investigate layer variety for promoting pre-trained models. However, to perform such investigation for a common model backbone, e.g., with 24 layers, we need to evaluate performance of every candidate within an architecture space of 3 24 ≈ 2.8 × 10 11 candidates. Pre-training a single language model already needs to consume a large amount of computation, e.g., 2400 P100 GPU days for pre-training BERT . It is barely affordable to pre-train such a large amount of model candidates from scratch. To reduce the computation cost, inspired by recent works on Neural Architecture Search (NAS) (Guo et al., 2020;Cai et al., 2019), we construct a supernet according to the layer variety discussed above and pre-train it with Masked Language Modeling (MLM) (Devlin et al., 2019) objective. After obtaining the pre-trained supernet, we develop an evolutionary algorithm guided by MLM evaluation accuracy to search an effective architecture with specific layer variety. We call the resulted model LV-BERT. Extensive experiments show that LV-BERT outperforms BERT and its variants. The contributions of our paper are two-fold. Firstly, to the best of our knowledge, this work is the first to exploit layer variety w.r.t. both layer types and orders for pretrained language models. We found convolutions and layer orders both benefit pre-trained model performance. We hope our observations would facilitate the development of pre-trained lauguage models. Secondly, our obtained LV-BERT shows superiority over BERT and its variants. For example, LV-BERT-small achieves 79.8 on GLUE testing set, 1.8 higher than the baseline ELECTRAsmall (Clark et al., 2020).

Related Work
Pre-trained Language Models Pre-trained language models have achieved great success and promoted the development of NLP techniques. Instead of separate word representation (Mikolov et al., 2013a,b), McCann et al. (2017) and Peters et al. (2018b propose CoVe and ELMo respectively which both utilize LSTM (Hochreiter and Schmidhuber, 1997) to generate contextualized word representations. Later, Radford et al. (2018) introduce GPT that changes the backbone to transformers where self-attention and feed-forward layers are arrayed interleavedly. They also propose generative pre-training objectives. BERT (Devlin et al., 2019) continues to use the same layer set and order for backbone but employs different pre-training objectives, i.e., Masked Language Modeling and Next Sentence Prediction. Then more works introduce new effective pre-training objectives, like Generalized Autoregressive Pretraining (Yang et al., 2019), Span Boundary Objective (Joshi et al., 2020) and Replaced Token Detection (Clark et al., 2020). Besides designing pre-training objectives, some other works try to extend BERT by incorporating knowledge (Zhang et al., 2019;Peters et al., 2019;Xiong et al., 2020) or with multiple languages Conneau and Lample, 2019;Chi et al., 2019). These works utilize the stereotyped layer pattern, which is unnecessarily optimal (Press et al., 2020), inspiring us to further investigate more layer variety to improve pre-trained models. To the best of our knowledge, we are the first to exploit layer variety from both the layer type set and the layer order for pre-trained language models.
Neural Architecture Search Manually designing neural architecture is a time-consuming and error-prone process (Elsken et al., 2019). To solve this, many neural architecture search algorithms are proposed. Pioneering works utilize reinforcement learning (Zoph and Le, 2017;Baker et al., 2017) or evolutionary algorithm (Real et al., 2017) to sample architecture candidates and train them from scratch, which demand huge computation that ordinary researchers can not afford. To reduce computation cost, recent methods (Pham et al., 2018;Xie et al., 2018;Brock et al., 2018;Cai et al., 2018;Bender et al., 2018;Wu et al., 2019a;Guo et al., 2020) adopt a weight sharing strategy that a supernet subsuming all architectures is trained only once and all architecture candidates can inherit their weights from the supernet. Despite the boom of NAS research, most works focus on computer vision tasks (Chen et al., 2019;Ghiasi et al., 2019;Liu et al., 2019a), while NAS on NLP is not fully investigated. Recently, So et al. (2019) and  search architectures of transformers for translation tasks.  leverage differentiable neural architecture to automatically compress BERT with task-oriented knowledge distillation for specific tasks. Zhu et al.
(2020) utilize architecture search to improve models based on pre-trained BERT for the relation classification task. However, these methods only focus on specific tasks or the fine-tuning phase. Besides, Khetan and Karnin (2020) employ pre-training loss to help prune BERT, but their method can not find new architectures. Different from them, our work is the first to use NAS to help explore new architectures in a pre-training scenario for general language understanding.

Method
An overview of our approach is shown in Figure  2. We first define the layer variety to introduce a large architecture search space, and then pre-train a supernet subsuming all candidate architectures, followed by an evolutionary algorithm guided by pre-training MLM (Devlin et al., 2019) accuracy to search an effective model. In what follows, we will give detailed descriptions.

Layer Variety
As shown in Figure 1(a), the proposed layer variety contains two aspects: layer type and layer order, both of which are important for the performance of pre-trained models but not exploited before.
Layer Type The layer type set of current BERTlike models consists of self-attention for information communication and feed-forward for nonlinear transformation. However, as a global operator, self-attention needs to take as input all tokens to compute attention weights for each token, which is inefficient in capturing local information (Wu et al., 2019b;. We notice that convolution (LeCun et al., 1998;Krizhevsky et al., 2012), as a local operator, has been successfully applied in language models (Zeng et al., 2014;Kim, 2014;Kalchbrenner et al., 2014;Wu et al., 2018Wu et al., , 2019b. A typical example is the dynamic convolution (Wu et al., 2018) for machine translation, language modeling and summarization. Therefore, we augment the layer type set by introducing dynamic convolution as a new layer type. The layer set considered in this work thus contains three types of layers, where the set elements denote self-attention, feedforward and dynamic convolution layers respectively. See Appendix for more detailed formulation description on them.
Layer Order The other variety aspect is layer order. The most widely-used order for pre-trained models is the interleaved order (Vaswani et al., 2017;Devlin et al., 2019). For a model with 24 layers, the interleaved order can be expressed by the following list, Similarly, the sandwich order (Press et al., 2020) can be expressed as (3) Step 1 ...
Step  (Devlin et al., 2019) by only uniformly sampling one type of layer into training at each layer. z Apply evolutionary algorithm to produce candidate models. { The candidate models inherit their weights from the supernet. | The candidate models with inherited weights are directly evaluated with pre-training MLM accuracy on validation set. } The accuracy is used to guide the evolutionary algorithm for generating new candidate models.~After T iterations, the candidate with best pre-training accuracy is output as LV-BERT-small. LV-BERT-small can be scaled up to LV-BERT-medium/base with larger hidden size.
Beyond the above manually designed orders, we take advantage of neural architecture search to identify more effective layer orders for pre-trained models. The order to be discovered can be expressed as where L i ∈ L type and N is the number of layers.
Here, N is set to 24, following common practice.

Supernet
The layer variety introduced above leads to a huge architecture space of 3 24 ≈ 2.8 × 10 11 candidate models to be explored. Thus, it is not affordable to pre-train every candidate model in the space from scratch to evaluate their performance since the pre-training procedure requires huge computations. To reduce the search computations, recent NAS works (Pham et al., 2018;Guo et al., 2020;Cai et al., 2019) exploit a weight sharing strategy. It first trains a supernet subsuming all candidate architectures, and then each candidate architecture can inherit its weights from the trained supernet to avoid training from scratch. Inspired by this strategy, we construct a supernet where each layer contains all types of layers, i.e., self-attention, feedforward, and dynamic convolution. The supernet architecture can be expressed as Masked Language Modeling (MLM) (Devlin et al., 2019) is utilized as the pre-training objective to pretrain the supernet since MLM accuracy can reflect the model performance on downstream tasks (Lan et al., 2020). Most weight sharing approaches on NAS (Wu et al., 2019a; train and optimize the full supernet: the output of each layer is the weighted sum of all types of candidate layers. However, it cannot guarantee the sampled single type of layer also works (Guo et al., 2020).
To handle this issue, we propose to randomly sample a submodel from the supernet to participate in forward and backward propagation per training step (Cai et al., 2018;Guo et al., 2020). The sampled submodel architecture can be expressed as where L i ∈ L type ∼ U with uniform probability distribution P r = 1/3. In this pre-training method, the optimized supernet weights can be expressed Algorithm 1: Evolutionary Search Guided by Pre-training MLM Accuracy Input: W A : supernet weights; P : population size; D val : pre-training validation set; T : # iteration; N cro : # crossover; N mut : # mutation; p: mutation probability; k: # top candidates for crossover and mutation Output: a * : the architecture with the best pre-trianing MLM validation accuracy S 0 := Init(P ); // Randomly generate P architecture candidates S topk := ∅; // The set of top k candidates as where W (a) denotes the submodel weights inherited from the supernet, N means the submodel with specific architecture and weights, L pre−train denotes the pre-training MLM loss and a ∼ U (A) means a is uniformly sampled from A.

Evolutionary Search
Inspired by the recent NAS works (Elsken et al., 2019;Ren et al., 2020;Guo et al., 2020;, we adopt an evolutionary algorithm (EA) to search the model. Previously Real et al. (2017) utilized an evolutionary method in NAS but they trained each candidate model from scratch which is costly and inefficient. Instead, thanks to the supernet mentioned above, we do not need to train the candidate models from scratch since their weights can be inherited from the supernet. Next problem is how to select indicator of the candidate models to guide the EA. Note that our goal is to search a general pre-trained model to benefit a variety of downstream tasks instead of a specific task.
Traditional NAS methods Zhu et al., 2020) use downstream task performance as the objective to search for task-specific models. Instead, similar to the work by Khetan and Karnin (2020) that utilize pre-training loss to prune BERT, our method uses pre-training MLM accuracy to search for a unified architecture that can generalize well to different downstream tasks. Besides, using this accuracy, candidate models can be directly evaluated on pre-training validation set without any fine-tuning on specific tasks, which can help save computations.
The detailed algorithm description is shown in Algorithm 1. Crossover(S topk , N cro ) means the procedure to generate N cro new candidate architectures that two candidate architectures randomly selected from top k candidate set S topk are crossed to produce a new one. Similarly, Mutation(S topk , N mut , p) denotes the procedure to generate N mut new candidates that a random candidate from S topk mutates its every layer choice with probability p to generate a new one. Finally, the candidate architecture with highest pre-training validation accuracy in S topk is returned as LV-BERT. The algorithm is set with population size P of 50, search iteration number T of 20, crossover number N cro of 25, mutation number M mut of 25, mutation probability p of 0.1, top candidate number k of 10 for crossover and mutation.  et al., 2015). However, BooksCorpus is no longer publicly available. To ease reproduction, we train models on OpenWebText (Gokaslan and Cohen, 2019) that is open-sourced and of similar size with the corpus used by BERT. When pre-training the supernet, we leave 2% data as our validation set for evolutionary search.
Fine-tuning Datasets To compare our model with other pre-trained models, we fine-tune LV-BERT on GLUE (Wang et al., 2018), including various tasks for general language understanding, and SQuAD 1.1/2.0 (Rajpurkar et al., 2016(Rajpurkar et al., , 2018 for question answering. See Appendix for more details of all tasks.  Table 1: Performance of the models with different layer types and orders on the GLUE development set. DC, SA and FF denote dynamic convolution, self-attention and feed-forward layers respectively. For each design of layer type set, "Random" means the best order among five randomly generated ones that are estimated by training model from scratch. "Randomly searched" or "EA searched" are both based on the supernet. "Randomly searched" denotes the orders searched at random while "EA searched" denotes ones searched by evolutionary algorithm. * denotes the methods implemented by us for language pre-training. All models are pre-trained on OpenWebText by 1M steps with sequence length 128 using ELECTRA (Clark et al., 2020) pre-training objective except BERT-small using MLM objective.

Implementation Details
Model Size Similar to Devlin et al. (2019), Clark et al. (2020) and , we define different model sizes, i.e., "small", "medium" and "base", with the same layer number of 24 but different hidden sizes of 256, 384, and 768, respectively. The detailed hyperparameters are shown in Appendix.
Pre-training Supernet To reduce training cost, we construct the supernet only in small size. Since the layer number of models in medium and base sizes are the same as that of the small-sized one, the obtained architecture of LV-BERT-small can be easily scaled up to the ones of medium and base sizes. We use Adam (Kingma and Ba, 2015) to pre-train the supernet with MLM loss (Devlin et al., 2019) , learning rate of 2e-4, batch size of 128, max sequence length of 128 and pre-training step number of 2 million. See Appendix for more details.
Evaluation Setup To compare with other pretrained models, we pre-train the searched LV-BERT architecture for 1M steps from scratch on the Open-WebText (Gokaslan and Cohen, 2019) using Re-placed Token Detection (Clark et al., 2020) since it can save computation cost. We fine-tune LV-BERT on GLUE (Wang et al., 2018) and SQuAD (Rajpurkar et al., 2016 downstream tasks with most hyperparameters the same as those of ELECTRA (Clark et al., 2020) for fair comparison. For GLUE tasks, the evaluation metrics are Matthews correlation for CoLA, Spearman correlation for STS, and accuracy for other tasks, which are averaged to get GLUE score. We utilize evaluation metrics of Exact-Match (EM) and F1 for SQuAD 1.1/2.0. Some of the fine-tuning datasets are small, and consequently, the results may vary substantially for different random seeds. Similar to ELECTRA (Clark et al., 2020), we report the median of 10 fine-tuning runs from the same pretrained model for each result. See Appendix for more evaluation details.

Ablation Study
Layer Variety Various models are constructed with different layer variety designs, and their results on GLUE development set are shown in Table  1. For the layer types, if only two layer types are provided, selecting self-attention and feed-forward yields the best result, which can always achieve performance higher than 80 under different search methods. With only dynamic convolution and feedforward, the performance drops dramatically to around 65. Surprisingly, without feed-forward, the layer set of dynamic convolution and self-attention can still achieve relatively good score, near 80. When using all the three layer types, we can obtain the best 81.8 score, 1.4 higher than the strong baseline ELECTRA (80.4) and 0.6 higher than the model searched with only self-attention and feedforward (81.2). This indicates that it is effective to augment the layer type set by including convolution to extract local information for pre-trained models.
For layer orders, with the same layer types, the models with either EA or randomly searched orders perform better than those with randomly sampled orders, reflecting the importance of investigating layer orders. For example, with the same layer types of self-attention and feed-forward, the EA searched model obtains 81.2 score, improving BERT/ELECTRA by 6.1/0.8 as well as Sandwich by 2.6. Table 1 shows the results with different search methods. "Random" means for each design of layer type set, the order is the best one among 5 randomly generated orders that are estimated by training models from scratch. "Randomly searched" and "EA searched" are both supernet-based methods, in which the weights of candidate models are inherited from the supernet. "Randomly searched" produces candidate models at random for estimation while "EA searched" generates candidate models with evolutionary algorithm guided by the pre-training MLM accuracy. With the same layer types, EA searched orders are generally better than randomly searched ones while the randomly searched ones are generally better than random ones. Figure 3 plots the pre-trianing MLM evaluation accuracy over search iterations with both random and evolutionary search methods. It shows that the accuracy of evolutionary search is obviously higher than that of random search, demonstrating the effectiveness of evolutionary search.

LV-BERT Architecture
As shown in Table 1, LV-BERT achieves the best performance. Its architecture is   When running the evolutionary method with different seeds, we see that the resulting models prefer stacking dynamic convolutions at the bottom two layers for extracting local information and self-attention at the top layer to fuse the global information. According to these observation, for ELECTRA-small, if we replace the bottom two layers with dynamic convolutions or the top layer with self-attention, the performance can be improved by 0.3 or 0.5 respectively on GLUE development set. If we replace the bottom 8 layers with manually designed 'ccsfccsf' ('c', 's' and 'f' denote dynamic convolution, self-attention and feed-forward layers, respectively) and replace the top 8 layers with manually designed 'ssfsssfs' together, we observe 0.7 performance improvement. These results show that it is helpful to stack dynamic convolution at the bottom and self-attention at the top.

Generalization to Larger Models
We only investigate layer variety and search models in a small-sized setting to save computation cost. It is interesting to know whether the searched models can be generalized to larger models with large hidden size. The results are shown in Table   2. For larger model size "medium" and "base", LV-BERTs still outperform other baseline models, demonstrating the good generalization in terms of model size.

Comparison with State-of-the-arts
We compare LV-BERT with state-of-the-art pretrained models (Radford et al., 2018;Devlin et al., 2019;Clark et al., 2020;Sanh et al., 2019;Jiao et al., 2020; on GLUE testing set and SQuAD 1.1/2.0 to show its advantages. Although more pre-training data/steps and lager model size can significantly help improve performance (Yang et al., 2019;Lan et al., 2020), due to the computation resource limit, we only pre-train our models in small/medium/base sizes for 1M steps with OpenWebText (Gokaslan and Cohen, 2019). We leave evaluating models with more pre-training data/steps and larger model size for future work. We also list some knowledge distillation methods for comparison. However, note that these methods rely on a pre-trained large teacher network and thus are orthogonal to LV-BERT and other methods. Table 3 presents the performance of LV-BERT and other pre-trained models on GLUE testing set. It shows that LV-BERT outperforms other pre-trained models with similar model size. Remarkably, LV-BERT-small/base achieve 79.8/85.1, 1.8/1.6 higher than strong baselines ELECTRAsmall/base. Even compared with knowledge distillation based model MobileBERT , LV-BERT-medium still outperforms it by 0.3. Since there is nearly no single model submission on SQuAD leaderboard 2 , we only compare LV-BERT with other pre-trained models on the development sets. The results are shown in Table 4. We find that LV-BERT-small outperforms ELECTRA-small significantly, like F1 score 73.7 versus 69.4 on SQuAD 2.0. However, when we generalize LV-BERT-small to base size, the gap between LV-BERT and ELECTRA with base size is narrower than that with small size. One reason may be LV-BERT-small is searched by our method while LV-BERT-base is only generalized from LV-BERT-small with larger hidden size.

Conclusion
We are the first to exploit layer variety for improving pre-trained language models, from two aspects, i.e., layer types and layer orders. For layer types, we augment the layer type set by including convolution for local information extraction. For layer orders, beyond the stereotyped interleaved one, we explore more effective orders by using an evolutionary based search algorithm. Experiment results show our obtained model LV-BERT outperforms BERT and its variants on various downstream tasks.

Acknowledgments
We would like to thank the anonymous reviewers for their insightful comments and suggestions. This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-100E/-2019-035). Jiashi Feng was partially supported by MOE2017-T2-2-151, NUS ECRA FY17 P08 and CRP20-2017-0006. The authors also thank Quanhong Fu and Jian Liang for the help to improve the technical writing aspect of this paper. The computational work for this article was partially performed on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg). Weihao Yu would like to thank TPU Research Cloud (TRC) Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014

A Details about Layer Types
For a layer, assume its input is I ∈ R s×c and output is O ∈ R s×c , where s is the sequence length and c is the hidden size (channel dimension). For simplicity, c takes the same value for the input and output.

Self-Attention
The self-Attention layer, also known as multi-head self-attention (Vaswani et al., 2017), transforms the input by three linear transformations into the key K, query Q and value V vectors respectively, where where h is the number of heads and d is the head dimension.
The above K and Q are used to compute their similarity matrix M which is then used to generate new value V : where M ∈ R h×s×s and V ∈ R s×c . Finally, a linear transformation is used to exchange information between different heads, followed by shortcut connection and layer normalization, where Feed-Forward The feed-forward layer (Vaswani et al., 2017) includes two linear transformations with a non-linear activation, followed by a shortcut connection and layer normalization, where W 1 ∈ R c×rc and W 2 ∈ R rc×c with a ratio r. GELU(·) denotes the Gaussian Error Linear Unit (Hendrycks and Gimpel, 2016).

Dynamic Convolution
Different from the vanilla dynamic convolution that directly generates dynamic kernel from V ∈ R s×c , in this work, we supplement a separate convolution (Howard et al., 2017) with depthwise weights W Dep ∈ R k×c (k is the convolution kernel size, set as 9 in this paper) and pointwise weights W Poi ∈ R c×c to extract local information to help the following kernel generation. Denoting the output as S ∈ R s×c , the separate convolution can be formulated as Then the output of separate convolution is used to generate dynamic kernels, where W Dyn ∈ R c×hk and D ∈ R h×s×k . Then lightweight convolution is applied to the reshaped V = Reshape(V ) ∈ R h×s×d . The output C ∈ R h×s×d can be expressed as Finally, C is reshaped to C = Reshape(C) ∈ R s×c and a linear transformer is applied to fuse the information among multiple heads, followed by a short connection and layer normalization, where W Out ∈ R c×c and b Out ∈ R c .

B.1 GLUE Dataset
Introduced by Wang et al. (2018), General Language Understanding Evaluation (GLUE) benchmark is a collection of nine tasks for natural language understanding, where testing set labels are hidden and predictions need to be submitted to the evaluation server 3 . We provide details about the GLUE tasks below.
CoLA The Corpus of Linguistic Acceptability (Warstadt et al., 2019) is a binary single-sentence classification dataset for predicting whether an sentence is grammatical or not. The samples are from books and journal articles on linguistic theory. pus (Dolan and Brockett, 2005) is a dataset for the task to predict whether two sentences are semantically equivalent or not. It is extracted from online news sources with human annotations.

MRPC The Microsoft Research Paraphrase Cor
MNLI The Multi-Genre Natural Language Inference Corpus (Williams et al., 2018) is a dataset of sentence pairs. Each pair has a premise sentence and a hypothesis sentence, requiring models to predict its relationships containing ententailment, contradiction or neutral. It is from ten distinct genres of spoken and written English.
SST The Stanford Sentiment Treebank (Socher et al., 2013) is a dataset for the task to predict whether a sentence is positive or negative in sentiment. The dataset is from movie reviews with human annotations.  RTE The Recognizing Textual Entailment (RTE) dataset is for the task to determine whether the relationship of a pair of premise and hypothesis sentences is entailment. The dataset is from several annual textual entailment challenges including RTE1 (Dagan et al., 2005), RTE2 (Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009).
QNLI Question Natural Language Inference is a dataset converted from The Stanford Question Answering Dataset (Rajpurkar et al., 2016). An example is a pair of a context sentence and a question, requiring to predict whether the context sentence contains the answer to the given question.
QQP The Quora Question Pairs dataset (Chen et al., 2018) is the dataset from Quora, requiring to Table 7: Architectures of different models and their performance on GLUE development set. In Architecture column, 0, 1, and 2 denote dynamic convolution, self-attention, and feed-forward layers respectively * denotes methods implemented by us for language pre-training.
determine whether a pair of questions are semantically equivalent or not.
STS The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs with human-annotated similarity score on a 1-5 scare.
WNLI Winograd NLI (Levesque et al., 2012) is a small dataset for natural language inference. However, there are issues with the construction of this dataset 4 . Therefore, this dataset is exclude in this paper for comparison as BERT (Devlin et al., 2019) etc.

B.1.1 SQuAD dataset
The Stanford Question Answering Dataset (SQuAD 1.1) (Rajpurkar et al., 2016) is a dataset of more than 100K questions which all can be answered by locating a span of text from the corresponding context passage. Besides this data, the upgraded version SQuAD 2.0 (Rajpurkar et al., 2018) supplements it with over 50K unanswerable questions.

C Pre-training Details
For supernet, We pre-train it for 2M steps with hyperparameters listed in Table 5, using Masked Language Modeling (MLM) pre-training objective (Devlin et al., 2019). This objective masks 15% input tokens that require the model to predict. The reason to use this objective is that the MLM valida-4 https://gluebenchmark.com/faq tion accuracy can reflect the performance of models on downstream tasks (Lan et al., 2020). For pre-training LV-BERTs and other compared baselines like DynamicConv (Wu et al., 2018) and Sandwich (Press et al., 2020) from scratch, we utilize Replaced Token Detection (RTE) pre-training objective (Clark et al., 2020). This objective employs a small generator to predict masked tokens and utilize a larger discriminator to determine predicted tokens from the generator are the same as original ones or not. RTE can help save computation cost but achieve good performance (Clark et al., 2020). We pre-train the models for 1M steps, mostly using the same hyperparameters as ELEC-TRA (Clark et al., 2020). We set the pre-training sequence length 128 that can help us save computation cost. For downstream task SQuAD 1.1/2.0 that needs longer input sequence length, we pre-train more 10% steps with the sequence length of 512 to learn the position embedding before fine-tuning. The hyperparameters are listed in Table 5.

D Fine-tuning Details
For fine-tuning on downstream tasks, most of the hyperparameters are the same as ELECTRA (Clark et al., 2020). See Table 6.

E Searched Architectures
The different searched architectures are listed in Table 7.