Training-free Neural Architecture Search for RNNs and Transformers

Neural architecture search (NAS) has allowed for the automatic creation of new and effective neural network architectures, offering an alternative to the laborious process of manually designing complex architectures. However, traditional NAS algorithms are slow and require immense amounts of computing power. Recent research has investigated training-free NAS metrics for image classification architectures, drastically speeding up search algorithms. In this paper, we investigate training-free NAS metrics for recurrent neural network (RNN) and BERT-based transformer architectures, targeted towards language modeling tasks. First, we develop a new training-free metric, named hidden covariance, that predicts the trained performance of an RNN architecture and significantly outperforms existing training-free metrics. We experimentally evaluate the effectiveness of the hidden covariance metric on the NAS-Bench-NLP benchmark. Second, we find that the current search space paradigm for transformer architectures is not optimized for training-free neural architecture search. Instead, a simple qualitative analysis can effectively shrink the search space to the best performing architectures. This conclusion is based on our investigation of existing training-free metrics and new metrics developed from recent transformer pruning literature, evaluated on our own benchmark of trained BERT architectures. Ultimately, our analysis shows that the architecture search space and the training-free metric must be developed together in order to achieve effective results. Our source code is available at https://github.com/aaronserianni/training-free-nas.


Introduction
Recurrent neural networks (RNNs) and BERT-based models with self-attention have been extraordinary successful in achieving state-of-the-art results on a wide variety of language modeling-based natural language processing (NLP) tasks, including question answering, sentence classification, tagging, and natural language inferencing (Brown et al. 2020;Palangi et al. 2016;Raffel et al. 2020;Sundermeyer, Schlüter, and Ney 2012;Yu et al. 2019).However, the manual development of new neural network architectures has become increasingly difficulty as models become larger and more complicated.Neural architecture search (NAS) algorithms aim to procedurally design and evaluate new, efficient, and effective architectures within a predesignated search space (Zoph and Le 2017).NAS algorithms have been extensively used for developing new convolutional neural network (CNN) architectures for image classifica-tion, with many surpassing manually-designed architectures and achieving SOTA results on many classification benchmarks (Tan and Le 2019;Real et al. 2019).
While NAS algorithms and methods have been successful in developing novel and effective architectures, there are two main problems that current algorithms face.The search space for various architectures is immense, and the amount of time and computational power to run NAS algorithms is prohibitively expensive (Mehta et al. 2022).Because traditional NAS algorithms require the evaluation of candidate architectures in order to gauge performance, each candidate architecture needs to be trained fully, taking hours or days to complete.Thus, past attempts at NAS have been critiqued for being computationally resource-intensive, consuming immense amounts of electricity, and producing large amounts of carbon emissions (Strubell, Ganesh, and McCallum 2019).These problems are particularly true for transformers and RNNs, as they have more parameters and take longer to train when compared to other types of neural networks (So, Le, and Liang 2019; Zhou et al. 2022).
Recently, there has been research into training-free NAS metrics and algorithms, which offer significant performance increases over traditional NAS algorithms (Abdelfattah et al. 2020;Mellor et al. 2021a;Zhou et al. 2022).These metrics aim to partially predict an architecture's trained accuracy from its initial untrained state, given a subset of inputs.However, prior research has focused on developing training-free NAS metrics for CNNs and Vision Transformers with image classification tasks.In this work, we apply existing and create our own training-free metrics for RNNs and BERTbased transformers with language modeling tasks.Our main contributions are:

Related Work
Since the development and adoption of neural architecture search, there has been research into identifying wellperforming architectures without the costly task of training candidate architectures.

NAS Performance Predictors
Prior attempts at predicting a network architecture's accuracy focused on training a separate performance predictor.2020) introduced a series of additional training-free metrics for CNNs with image classification tasks, based in network pruning literature, aiming to improve performance.They also tested on their metrics on other search spaces with different tasks, including NAS-Bench-NLP with RNNs and NAS-Bench-ASR, but found significantly reduced performance in these search spaces.

Training-free NAS Metrics
A series of training-free NAS metrics have been proposed in recent literature.These metrics look at specific aspects of an architecture, such as parameter gradients, activation correlations, and weight matrix rank.Most metrics can be generalized to any type of neural network, but have only been tested on CNN architectures.For transformer architectures, we also adapt various attention parameter pruning metrics as training-free metrics, scoring the entire network.

Synaptic Saliency
In the area of network pruning, Tanaka et al. (2020) proposed synaptic saliency, a score for approximating the change in loss when a specific parameter is removed.Synaptic saliency is based on the idea of preventing layer collapse while pruning a network, which significantly decreases the network's accuracy.Synaptic saliency is expressed by where L is the loss function of the network, θ is the network's parameters, and ⊙ is the Hadamard product.Abdelfattah et al. (2020) generalize synaptic saliency as a training-free metric for NAS by summing over all N parameters in the network: S = N i=1 S(θ i ).Abdelfattah et al. (2020) found that synaptic saliency slightly outperforms Jacobian covariance on NAS-Bench-201.

Jacobian Covariance
Jacobian Covariance is a training-free NAS metric for CNN networks proposed by Mellor et al. (2021b).Given a minibatch of input data, the metric assesses the Jacobian of the network's loss function with respect to the minibatch inputs, Further details of the metric can be found in the original paper.Celotti, Balafrej, and Calvet (2020) expand on Jacobian Covariance with a series of variations on the metric, aiming to speed up computation and refine the metric's effectiveness.These include using cosine similarity instead of a covariance matrix to calculate similarity, expressed by where J n is the normalized Jacobian and the minibatch has N inputs.They also add various noise levels to the input minibatch, hypothesizing that an architecture with high accuracy will be robust against noise.

Activation Distance
In a revised version of their paper, Mellor et al. (2021a) developed a metric that directly looks at the ReLU activations of a network.Given a minibatch of inputs fed into the network, the metric calculates the similarity of the activations within the initialized network between each input using the Hamming distance.Mellor et al. conclude that the more similar the activation map for a given set of inputs are to each other, the harder it is for the network to disentangle the representations of the inputs during training.

Hidden Covariance
We propose a new metric specific for RNNs, based on the hidden states between each layer of the RNN architecture.
Previous NAS metrics focus on either the activation functions within an architecture, or all parameters of the architecture.The hidden state of an RNN layer encodes all of the information of the input, before being passed to the next layer or the final output.Similar to Mellor et al. (2021a), we hypothesize that if the hidden states of an architecture given a minibatch of inputs are similar to each other, the more difficult it would be to train the architecture.
Given the hidden state H(X) of a specific layer of the RNN with a minibatch of N inputs X = {x n } N n=1 , observe the covariance matrix to be As with Mellor et al.'s Jacobian Covariance score, the final metric is calculated with the Kullback-Leibler divergence of the kernel of R, which has the where k = 10 −5 .

Attention Confidence, Importance, and Softmax Confidence
For transformer-specific metrics, we look into current transformer pruning literature.Voita et al. (2019) propose pruning the attention heads of a trained transformer encoder block by computing the "confidence" of a head using a sample minibatch of input tokens.Confident heads attend their output highly to a single token, and, hypothetically, are more important to the transformer's task.Behnke and Heafield (2020) attempt to improve on attention confidence by looking at the probability distribution provided by an attention head's softmax layer.Alternatively, Michel, Levy, and Neubig (2019) look at the sensitivity of an attention head to its weights being masked, by computing the product between the output of an attention head with the gradient of its weights.These three attention metrics are summarized by: Confidence where X = {x n } N n=1 is a minibatch of N inputs, L is the loss function of the model, and Att h and softmax h are an attention head and its softmax respectively.We expand these metrics into an overall score for the entire network by summing over all attention heads: A(X) = h Att h (X).

NAS Benchmarks
Because of the large search space for neural architectures, it is challenging to have direct comparisons between various NAS algorithms.A series of NAS benchmarks (Mehta et al. 2022) have been created, which evaluate a set of architectures within a given search space and store the trained metrics in a lookup table.These benchmarks include NAS-Bench-101 (Ying et al. 2019), NAS-Bench-201 (Dong and Yang 2020), and NAS-Bench-301 (Siems et al. 2021) with CNNs for image classification, NAS-Bench-ASR with convolutional LSTMs for automatic speech recognition (Mehrotra et al. 2021), and NAS-Bench-NLP with RNNs for language modeling tasks (Klyuchnikov et al. 2022).Because all the architectures in a NAS benchmark have already been trained, they also allow for easier development of NAS algorithms without the large amounts of computational power required to train thousands of architectures.However, there are currently no NAS benchmarks for transformer or BERT-based architectures, likely due to the longer time and higher computational power to train transformers.
To evaluate training-free metrics on RNNs, we utilize the NAS-Bench-NLP benchmark (Klyuchnikov et al. 2022), which consists of 14,322 RNN architectures trained for language modeling with the Penn Tree Bank dataset.The architecture search space is defined by the operations within an RNN cell, connected in the form of an acyclic digraph.The RNN architecture consists of three identical stacked cells with an input embedding and connected output layer.In our evaluations, the NAS-Bench-NLP architectures which did not complete training in the benchmark or whose metrics could not be calculated for were discarded, leaving 8,795 architectures.

BERT Benchmark for NAS
Because no preexisting NAS benchmark exists for BERTbased models, we need to pretrain and evaluate a large set of various BERT architectures in order to evaluate our proposed training-free NAS metrics.Certain choices were made in order to speed up pretraining.These included: using the  (Tuli et al. 2022), which has improvements over other proposed BERT search spaces.Foremost is that the encoder layers in FlexiBERT are heterogeneous, each having their own set of architecture elements.FlexiBERT also incorporates alternatives to the multi-headed self-attention into its search space.The search space is described in Table 1.
The architectures in the Flexibert search space are relatively small, as the hyperparameters in FlexiBERT search space spans those in BERT-Tiny and BERT-Mini (Turc et al. 2019).However, Kaplan et al. (2020) show many many attributes of a transformer architecture, including number of parameters, scale linearly with the architecture's performance.Thus, a transformer architecture can easily be scaled up by increasing its hyperparameter values equivalent to those found in larger architecture, in order to achieve greater performance.This methodology was utilized in EcoNAS algorithm (Zhou et al. 2020), which explores a reduced search space, before scaling up to produce the final model.
To allow for simpler implementation of the FlexiBERT search space and the utilization of absolute positional encoding, we keep the hidden dimension homogeneous across all encoder layers.In total, this search space encompasses 10,621,440 different transformer architectures.
ELECTRA Pretraining Instead of the traditional masked language modeling used to pretrain BERT-based models, we implemented the ELECTRA pretraining scheme (Clark et al. 2020), which uses a combination generator-discriminator model with a replaced token detection task.As the ELEC-TRA task is defined over all input tokens, instead of the masked tokens, it is significantly more compute efficient and results in better finetuning performance when compared to masked-language modelling.Notably, ELECTRA scales well with small amounts of compute, allowing for efficient pretraining of small BERT models.

Architecture Training and Evaluation
We pretrain a random sample of 500 models from the FlexiBERT subspace using ELECTRA with the OpenWebText dataset, consisting of 38 GB of tokenized text data from 8,013,769 documents (Gokaslan and Cohen 2019).OpenWebText is based on OpenAI's WebText dataset (Radford et al. 2019).Pretraining occurs with TPUv2s with 8 cores and 64 GB of memory, using Google Collabortory.We finetune and evaluate the architectures on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al. 2019).The hyperparameters used for pretraining and finetuning are the same as those used for ELECTRA-Small.However, the sampled architectures were only pretrained for 100,000 steps for the best tradeoff benefit between pretraining time and GLUE score.All GLUE results are from the dev set.

Experimental Results of Training-free Metrics
For the training-free NAS metrics presented, we empirically evaluate how well the metric performs in predicting the trained performance of an architecture.We use Kendall rank correlation coefficient (Kendall τ ) and Spearman rank correlation coefficient (Spearman ρ) to quantitatively evaluate the metrics by comparing them with the trained performance of the architectures within NAS-Bench-NLP and our BERT Benchmark.

Training-free Metrics for RNNs
We ran the training-free metrics on 8,795 architectures in NAS-Bench-NLP.A summary of our results are show in Figure 1.Computing these metrics was very efficient, only requiring a forward and backward pass with a single minibatch of sample data, in order to compute one set of gradients.Furthermore, all the metrics can be computed simultaneously on the same input and gradients.Most metrics preform poorly on predicting the loss of a trained RNN architecture, including all the existing trainingfree metrics designed for CNN architectures.None surpassed a Kendall τ value of 0.28.Our proposed Hidden Covariance score preforms the best out of all metrics, achieving a Kendall τ value of 0.3715.It is clear that the initialized hidden states of an RNN contain the most salient information for predicting the RNN's trained accuracy.

Training-free Metrics for BERT Architectures
We investigated the series of training-free metrics on our own NAS BERT benchmark of 500 architectures sampled from the FlexiBERT search space.Results are shown in Figure 2. Compared to their performance on NAS-Bench-NLP, all the training-free metrics, including our proposed metrics based on attention head pruning, performed poorly.Only the Attention Confidence metric had a significant positive correlation, with a Kendall τ of 0.27.
A notable reference point for training-free metrics is the number of trainable parameters in a transformer architecture.Previous research has shown a strong correlation between number of parameters and model performance across a wide range of transformer sizes and hyperparameters (Kaplan et al. 2020).Our NAS BERT Benchmark displays this same correlation (Figure 3).In fact, the Kendall τ value for number of parameters is 0.44, significantly surpassing all training-free metrics.
Great care must be used when developing training-free metrics to ensure that the metric is normalized for number of parameters or other high-level features of the network, such as number of layers or hidden size.In Zhou et al.'s proposed DSS-indicator score for vision transformers (a combi-  nation of synaptic saliency and synaptic diversity metrics), they did not normalize the score for the number of features in the network.Instead, the DSS-indicator almost directly corresponds to the number of parameters in an architecture, as shown in their figures, thus yielding their high Kendall Figure 4: Attention Confidence metric evaluated on architectures from the FlexiBERT search space, without normalization for number of features.The metric's performance substantially improves when not normalized, and its plot mirrors that of number of parameters, as indicated by its Kendall τ value.τ of 0.70.We witnessed a similar pattern with our series of metrics.For our highest performing score, Attention Confidence, had a Kendall τ of 0.49 without normalization for number of features, comparable to number of parameters, but decreased to 0.30 with normalization (Figure 4).
Neural architecture search for transformers is a fundamentally different task than neural architecture search for CNNs and RNNs.Almost all search spaces for transformers relies on the same fundamental paradigm of an attention module followed by a feed-forward module within each encoder/decoder block, connected linearly (Wang et al. 2020;Yin et al. 2021;Zhao et al. 2021).Conversely, most search spaces for CNNs and RNNs, including NAS-Bench-201 and NAS-Bench-NLP, use an cell-based method, typically with an acyclic digraph representing the connections between operations (Dong and Yang 2020;Jing, Xu, and Zugeng 2020;Klyuchnikov et al. 2022;Tan et al. 2019), allowing for significantly more flexibility in cell variation.For CNN and RNN search spaces, the connections between operations within a cell have a greater impact on the architecture's performance than number of parameters.In NAS-Bench-NLP, there is no correlation between number of parameters and model performance (Figure 5); hence, previous studies did not need to normalize their training-free metrics for number of parameters.Furthermore, we hypothesize that for transformer search spaces, the number of parameters in an architecture dominates the model performance, explaining the poor performance for training-free NAS metrics.With this hypothesis, we propose an alternative to training-free metrics for current transformer neural architecture search, based upon transformer scaling laws.When model size and number of parameters is not a concern, increasing the size and dimensions of the architecture will consistently increase model performance.Thus, one should limit the transformer search space to larger model sizes in order to find better performing models more quickly.
However, it is often the case that the number of parameters within a model must be limited due to various computational limitations, including training time and cost or deployment on resource-constrained devices.First, one should set a targeted number of parameters for the architectures within the transformer search space.The ratios between various hyperparameters within the network can then be searched for with the NAS algorithm.Kaplan et al. (2020) found that model performance varies minimally between different hyperparameter ratios when number of parameters is fixed for architectures that are homogeneous between layers.They also present the most optimal ratios between hidden size, feedforward dimension, number of layers, and number of attention heads.Therefore, increased focus should be placed on layer heterogeneity within the search space, with established hyperparameter ratios used as starting points.
While these suggestions can help with shrinking the search space for transformer architectures and speed up neural architecture search algorithms, they do not address the main problem regarding transformer architecture search: the inflexibility of current transformer search spaces.Unless transformer search spaces adopt the variability of connections provided by a cell-based methods, as used by CNN and RNN search spaces, simple heuristics such as number of parameters will be primary training-free predictor of transformer model performance.To our knowledge, two works have utilized a cell-based method for transformer search spaces, the original transformer architecture search paper, "The Evolved Transformer," by So, Le, and Liang, and its successor "Primer" (So et al. 2021).Some research has been done with cell-based search spaces for Conformers (Shi et al. 2021) andVision Transformers (Guo et al. 2020), but only on the convolution modules of the architectures.Ultimately, there is significant opportunity for growth regarding transformer architecture search, and with it training-free NAS metric for transformers.

Conclusion
In this paper, we presented and evaluated a series of trainingfree NAS metrics for RNN and BERT-based transformer architectures, trained on language modeling tasks.We developed new training-free metrics targeted towards specific architectures, hidden covariance for RNNs and three metrics based on attention head pruning for transformers.We first verified the training-free metrics on with NAS-Bench-NLP, and found our hidden covariance metric outperforms existing training-free metrics on RNNs.We then developed our own NAS benchmark for transformers within the Flexi-BERT search space, utilizing the ELECTRA scheme to significantly speed up pretraining.Evaluating the trainingfree metrics on our benchmark, our proposed Attention Confidence metric performs the best.However, the current search space paradigm for transformers is not well-suited for training-free metrics, and the number of parameters within a model is the most significant predictor of transformer performance.Our research shows that training-free NAS metrics are not universally successful across all architectures, and better transformer search spaces must be developed for training-free metrics to succeed.We hope that our work is a foundation for further research into training-free metrics for RNNs and transformers, in order to develop better and more efficient NAS techniques.

Figure 1 :
Figure 1: Plots of training-free metrics evaluated on 8,795 RNN architectures in NAS-Bench-NLP, against test loss of the architectures assessed on the Penn Tree Bank dataset when trained.Kendall τ and Spearman ρ also shown.Only our Hidden Covariance metric performed on the first and second layer of the RNN showed a substantial correlation between the metric and trained test loss.Some other metrics do have some positive correlation.

Figure 2 :
Figure 2: Plots of training-free metrics evaluated on 500 architectures randomly sampled from the FlexiBERT search space, against GLUE score of the pretrained and finetuned architecture.All metrics are normalized against number of features.Only our Attention Confidence metric displayed some positive correlation between the metric and final GLUE score.

Figure 3 :
Figure 3: Correlation between number of parameters in a BERT-based architecture and its pretrained and fintuned GLUE score, for 500 architectures from the FlexiBERT search space.Number of parameters shows a strong correlation with architecture performance, substantially outperforms all training-free metrics evaluated.

Figure 5 :
Figure 5: Plot of number of parameters against test loss for 8,795 RNNs architectures in NAS-Bench-NLP.Unlike the architectures in the FlexiBERT search space, there is no correlation between number of parameters and architecture performance for the architectures in NAS-Bench-NLP.
Istrate et al. (2019)017)andIstrate et al. (2019)developed methods called Peephole and Tapas, respectively, to embed the layers in an untrained CNN architecture into vector representations of fixed dimension.Then, both methods trained LSTM networks on these vector representations to predict the trained architecture's accuracy.Both methods achieved strong linear correlations between the LSTMs' predicted accuracy and the actual trained accuracy of the CNN architectures.In addition, the LSTM predictors can quickly evalutate large amounts of CNN architectures.The primary limitation of these methods is that the LSTM predictors require large amounts of trained CNN architectures in order to accurately train the predictors, thus not achieving the goal of training-free NAS.
2.2 Training-free Neural Architecture Search research on rank collapse in transformers, where for a set of inputs the output of a multi-headed attention block converges to rank 1, significantly harming the performance of the transformer.Zhou et al. use the Nuclearnorm of an attention heads's weight matrix W m as an approximation of its rank, creating a synaptic diversity score: Zhou et al. developed a metric specific for vision transformers (ViT) (Dosovitskiy et al. 2021).Synaptic diversity is based upon previous

Table 1 :
The FlexiBERT search space, with hyperparameter values spanning those found in BERT-Tiny and BERT-Mini.Hidden dimension and number of encoder layers is fixed across the whole architecture; all other parameters are heterogeneous across encoder layers.The search space encompasses 10,621,440 architectures.
Vaswani et al. (2017)T (Bidirectional Encoder Representations from Transformers)(Devlin et al. 2019)consists of a series of encoder layers with multi-headed selfattention, taken from the original transformer model proposed byVaswani et al. (2017).Numerous variations on the original BERT model have been developed.For our architecture search space, we utilize the FlexiBERT search space