RankNAS: Efficient Neural Architecture Search by Pairwise Ranking

This paper addresses the efficiency challenge of Neural Architecture Search (NAS) by formulating the task as a ranking problem. Previous methods require numerous training examples to estimate the accurate performance of architectures, although the actual goal is to find the distinction between “good” and “bad” candidates. Here we do not resort to performance predictors. Instead, we propose a performance ranking method (RankNAS) via pairwise ranking. It enables efficient architecture search using much fewer training examples. Moreover, we develop an architecture selection method to prune the search space and concentrate on more promising candidates. Extensive experiments on machine translation and language modeling tasks show that RankNAS can design high-performance architectures while being orders of magnitude faster than state-of-the-art NAS systems.


Introduction
Neural Architecture Search (NAS) has advanced state-of-the-art on various tasks, such as image classification Pham et al., 2018;Real et al., 2019;Tan et al., 2019), machine translation (Fan et al., 2020;So et al., 2019), and language modeling (Pham et al., 2018;Liu et al., 2019;. Despite the remarkable results, conventional NAS methods are computationally expensive, requiring training millions of architectures during search. For instance, obtaining a state-of-the-art machine translation model with an evolutionary algorithm requires more than 250 GPU years (So et al., 2019).
Several techniques have been proposed to improve the search efficiency, such as sharing parameters among all architectures (Pham et al., 2018;Cai et al., 2018;Zhong et al., 2018), predicting the performance instead of full training (Liu et al., 2018; * Corresponding author.  Baker et al., 2018;Wen et al., 2020;Wei et al., 2020), and searching over a continuous space (Liu et al., 2019;. Unfortunately, these approaches still suffer from the high cost of predicting the performance of each candidate architecture. An inherent reason for this is that obtaining accurate performance requires training numerous neural networks to convergence, as described in Sec. 2.2. However, it is unnecessary to predict the model performance as in previous NAS methods. Rather, all we need is to distinguish architectures of different quality in NAS, say, ranking these architectures. In this paper, we approach the problem by formulating NAS as a ranking task. Here we propose RankNAS, a ranking model for comparing different architectures. One of the key challenges is that directly ranking all architectures in a large search space is still computationally infeasible. Therefore, we adopt the pairwise method (Burges et al., 2005;Wauthier et al., 2013), where the ranking problem is reduced to a binary classification problem over architecture pairs. To speed up RankNAS further, we develop an architecture selection method that chooses the most promising architectures for evalu-ation according to the importance of features, e.g., the topology of architectures.
We test RankNAS on well-established machine translation and language modeling benchmarks. Experiments show that RankNAS is orders of magnitude faster than standard NAS systems and can find better architectures. Notably, RankNAS is generic to different tasks and evaluation metrics. It achieves competitive results on hardware-aware NAS tasks and is 10× faster than the HAT baseline . It also discovers new architectures that outperform vanilla Transformer by +1.8 BLEU points on the IWSLT'14 De-En data and +1.5 BLEU points on the WMT'14 En-De data, surpassing the Evolved Transformer (So et al., 2019) with 150,000× less search cost.

Preliminaries
NAS generally consists of two steps: 1) sample architectures from the pre-defined search space, and 2) estimate the performance of these samples. This work focuses on the performance estimation step, which is the efficiency bottleneck of NAS.

Search Space
The search space contains all possible architectures for the search. In this work, we take the Transformer architecture for description, but the discussed problem and solutions are general and can be applied to other models. Following HAT , we represent a Transformer architecture as a set of features and search for the optimal model configuration.
An overview of the search space is shown in Figure 2. It is extended from the HAT's space and inspired by manually designed Transformer variants, including Relative Position Representations (Shaw et al., 2018) and Deep Transformer . The search space can also be represented as a supernet where each sub-network is a unique architecture. The search space contains around 10 23 possible architectures, as detailed in Appendix A.1. It is computationally prohibited to explore such a large space with an exhaustive method.

Performance Estimation
Let A denotes the search space, and each architecture in it is represented by a feature vector α. Formally, the goal of NAS is to find the optimal architecture α * with the best performance. The per-  Figure 2: The architecture search space. We search for the optimal model size, e.g., the number of layers, and network topology, e.g., connections between different layers. The encoder part is ignored in the language modeling task. Appendix A.1 gives more details about the design choices for different tasks. formance can be measured by some metrics, such as accuracy or latency. The performance estimation process consists of two steps: 1) estimate the performance of all architectures, and 2) choose the architecture with the optimal performance.
Without loss of generality, we define S(·) as the performance evaluated by some metrics. The task here is to find the most promising architecture with maximum S(·). Standard NAS methods solve this problem by learning to estimate the performance of each architecture. The objective is given by: where w is the weights associated with the architecture. S val and S train are the evaluation results on the validation set and training set, respectively. Optimizing Eq. 1 is time-consuming as obtaining the optimal weights for each architecture requires training them to converge. Although we can share the weights among all architectures to amortize the cost, performance evaluation is still nontrivial and requires numerous training steps.

NAS as Ranking
As mentioned in Sec. 2.2, the goal of NAS is to find promising architectures that achieve high performance on unseen data. NAS requires distinguishing whether the architectures are "good" or "bad" rather than predicting accurate performance. Therefore, it is natural to treat NAS as a ranking problem, in which the explicit goal is to rank different architectures correctly.

Pairwise Ranking
Problem Formulation. Given an architecture α, we define a score s on it by a function r(·): where p is the parameter of the scoring function. We implement the scoring function with a gradient boosting decision tree, as detailed in Sec. 4.1. We want to optimize p such that s assigns high scores to good architectures and low scores to bad architectures. This induces a ranking of the candidate architectures in the search space. It is infeasible to sort all candidate architectures in a large search space directly. A solution is to reduce the listwise ranking problem to the pairwise ranking problem. Fortunately, the properties of the NAS task allow us to achieve the goal. As described in Dudziak et al. (2020), the relation between any pair of performance is antisymmetric, transitive and connex. This makes it possible to rank all architectures via pairwise comparisons, substantially reducing the training complexity.
Training Set Construction. In pairwise ranking, the learning task is framed as a binary classification of architecture pairs into two categories: correctly ordered and incorrectly ordered. Given an architecture pair (α i , α j ) and the order of performanceP ij , we can construct training examples (α i , α j ,P ij ) for the classification by comparing the two values. Note thatP ij is a 0-1 variable. For example, if α i is better than α j , we would add (α i , α j , 1) and (α j , α i , 0) to the training set.
Optimization. Consider a pair of architectures (α i , α j ), scored by s i and s j , respectively. The probability of α i being better than α j is given by the difference through an activation function g: Algorithm 1: Training of RankNAS Input: search space A and ranking model r 1 while r not converged do 2 training example construction: sample (α i , α j ) from A, computeP ij by comparing their performance; 3 classification: compute scores (s i , s j ); 4 optimization: optimize r w.r.t. Eq. 6.

end
We assume that P ij ≥ 0.5 means α i is better than α j while P ij < 0.5 means α j is better than α i . Here we use a logistic function to achieve this goal: Similarly, P ji can be induced by: Denote the gold score of α i being better than α j as P ij . We use the cross-entropy loss function for the classification. The loss for a pair of inputs is: Compared with Eq. 1, Eq. 6 just requiresP ij . In particular, we use the intermediate performance measured on the validation set during training. It is much easier than assessing the accurate performance of candidate architectures. In this sense, the ranking model is "easier" to learn and may not need many training samples as in performance prediction. RankNAS also enables efficient optimization through gradient methods. Algorithm 1 describes the complete training process of the ranking model.

Applying Pairwise Ranking
Although the training time of the ranking model is heavily reduced, it is still challenging to apply it to the ranking of all architectures in the search space A. The challenge is that exploring all architectures is computationally expensive, even when the task is a binary classification.
Feature Importance. Inspired by previous feature selection methods (Breiman, 2001;Fisher et al., 2019), we measure the importance of an architectural feature (e.g., the number of layers) by calculating the increase in the model error after permuting the feature. We assume that each architecture α is represented by a feature vector f ∈ R M ×N , where M is the number of different features, and N is the dimension of feature vectors. Also, we assume a set C that contains n architectures sampled from the search space. We first estimate the original model error L total on C using the accumulation of the prediction errors. For any feature f i ∈ f , we randomize it for each architecture in C. Then the randomized architectural features are passed to the ranking model and yield an error L i . The importance of the i-th feature f i is defined by: where a higher value implies f i is more important.
Search Space Pruning. It is easy to select valuable architectural features with the above measure.
Given all features f ∈ R M ×N , we discard those with a score less than a threshold θ and obtain the selected features f ∈ R M ×N , where M < M . Then we can prune the search space according to the selected features. For instance, if the feature Embedding Dimension is not selected, we will keep it fixed during the search. Finally, we only search over the architectures in the reduced search space. An overview of the search process is presented in Figure 3. As described in Sec. 3.1, the training of the proposed ranking model is much cheaper than previous methods, which need to optimize the parameters for all architectures. Pruning search space further reduces the number of architectures to be evaluated. Also, the sampling procedure can be implemented with any existing NAS search strategy, e.g., Random Search (RS) or Evolution Algorithm (EA).

Experimental Setups
We evaluate our methods on language modeling and machine translation tasks. In the experiments, we search for hardware-aware architectures and high-accuracy architectures.
Training Setups. For machine translation, we experiment on the IWSLT'14 De-En and WMT'14 En-De tasks using the identical settings as . For language modeling, we experiment on the WikiText-103 dataset (Merity et  2017) with the same settings as . We set the maximum number of tokens per sample to 1,843 to fit the memory constraints and apply gradient accumulation to keep the same batch size as Baevski and Auli (2019) 's work. All models are trained with mixed precision on 8 NVIDIA RTX 2080 Ti GPUs except for IWSLT ones, which only take one GPU for training.
Ranking Model Setups. We implement the ranking model (binary classifier) described in Sec.3.1 with LightGBM (Ke et al., 2017) and set the learning rate to 0.1. To prevent overfitting, we set the maximum number of leaves to 30 and the tree's maximum depth to 6. We also use the default regularization terms and apply the early stopping strategy to the training. Specifically, the training stops if the validation score does not increase for 5 rounds. After training the ranking model, we apply the search space pruning method to find the most valuable architectural features for different tasks and hardware. There are two hyper-parameters for pruning: the sample size and the threshold. We set them to 200/1.15 and 300/1.25 for the hardwareaware architecture search and high-accuracy architecture search, respectively.
Architecture Search Setups.

Results
Hardware-Aware Architecture Search. The hardware-aware NAS aims to maximize the accuracy under specified latency constraints on different hardware platforms. We first rank architectures by their latencies and pick those that meet the constraint to achieve this goal. Then we rank the selected architectures by their losses on the validation set and choose the best one. For machine translation tasks, we use the same search space as HAT , which contains around 10 15 possible architectures. For the language modeling task, we use the following search space: [10,12,14] for decoder layer number, [768,1024] for embedding dimension, [3072,4096,5120] for hidden dimension, and [8,12,16] for the head number in attention modules. We add a simple linear projection without bias if two adjacent layers have different hidden sizes. Table 1 shows the results of RankNAS comparing to HAT  and Transformer (Vaswani et al., 2017) on the machine translation tasks. Our method is effective in reducing the search cost for different tasks and hardware platforms. For instance, it requires 10.53× less cost to find a comparable architecture on the WMT task. The discovered architectures also have the lowest latencies with the same or better BLEU scores on most tasks. For example, the architecture designed for the CPU is 2.68× faster than the standard Transformer.
We present the architecture search results for language modeling on the WikiText-103 test data in Table 2. All models are evaluated with a context window of 2,560 tokens, following . Our method significantly accelerates the baseline on different devices. Specifically, our method speeds up the baseline by 2.59× on the CPU and 1.83× on the GPU. Our model also obtains a perplexity of 18.13, which outperforms Transformer-XL  and is comparable to the state-of-the-art language model, e.g., Sandwich-Transformer (Press et al., 2020).
High-Accuracy Architecture Search. Unlike hardware-aware architecture search, the highaccuracy architecture search only optimizes accuracy and does not consider latency. In the experiments, we enlarge the HAT's search space by introducing two additional features Relative Attention Position (Shaw et al., 2018) and Layer Norm Position, as shown in Table 5 and Table 6  of magnitude larger than HAT. We compare RankNAS with state-of-the-art machine translation models designed by human experts and models discovered by other NAS methods. The results are presented in Table 3. RankNAS consistently outperforms other methods in both the IWSLT and WMT tasks. It demonstrates that RankNAS can also design high-accuracy architectures. Notably, the discovered architectures achieve a +1.8 BLEU improvement on the IWSLT task and a +1.5 BLEU improvement on the WMT task than the standard Transformers baseline (Vaswani et al., 2017). We show that RankNAS surpasses the Evolved Transformer (So et al., 2019), with orders of magnitude fewer search costs. RankNAS also matches the performance of gradient-based methods, including NAO (Fan et al., 2020) and DARTSformer (Zhao et al., 2021).

Analysis
We analyze both the accuracy and efficiency of our search method and study the effect of different features on model performance.

Architecture Ranking Accuracy
To study the accuracy of the proposed method, we evaluate it on the IWSLT translation task. In the experiment, we randomly sample 200 different architectures from the HAT search space (small) and the enlarged search space (large) introduced in Sec. 4.2. We train these architectures from scratch and measure their BLEU scores on the test set. Table 4 presents the Kendall and Spearman rank correlation coefficient between the predicted results and the real scores. It shows that RankNAS outperforms HAT in terms of different ranking correlations. For example, RankNAS achieves a high Kendall's Tau of 0.883 and 0.826 on small and large spaces. This indicates that the predicted ranking is very close to the real results. Importance of Ranking Accuracy. Although our ranking model is more accurate than prior methods, a question remains: how does ranking accuracy affect the search quality? We analyze the impact of different ranking models on the highaccuracy NAS task. Figure 6 compares two ranking models with different ranking correlation coefficients. The results are obtained by best-so-far models trained from scratch on the IWSLT'14 De-En data. Results show that inaccurate ranking leads to poor search results. It indicates that an accurate ranking model is essential for architecture search.

Analysis of Discovered Architectures
We present the discovered architectures in Appendix A.2 and analyze important features for different hardware on the IWSLT'14 De-En task. Figure 5 (top) plots the selected features for the CPU. It shows that the decoder FFN dimension is the most important feature for predicting latency, followed by the decoder's arbitrary attention and the encoder FFN dimension. We also find that the decoder embedding dimension has a similar impact on latency as the number of decoder layers. Figure 5 (bottom) illustrates the results for the GPU. Similar to the CPU, the latency on the GPU has a high correlation to the decoder attention module. The main difference is that the latency on GPU is insensitive to FFN or embedding dimensions but more sensitive to the number of decoder layers.
The results indicate that we can design "shallow and wide" models for GPUs and "deep and thin" models for CPUs to achieve the Pareto-optimal state. Similar design insights have been verified in

Search Efficiency
Experiments in Sec. 4 show that our method has much lower search costs than previous works. We now analyze how does our method accelerates the architecture search.
Ranking Model Training Efficiency. The overall search cost includes the training time of the ranking model and the cost of the search process. Figure 1 compares our method and HAT on the IWSLT'14 De-En task. The two methods share the same search space and sampling strategy for search. We observe that the ranking model training takes most of the time. RankNAS speeds up the ranking model training by 10.34 times compared with HAT. Pruning the search space further reduces the 75% time of the search process. Thus the overall search cost is significantly reduced. It indicates that efficient training of the ranking model is essential to accelerate the search process.
Architecture Search Efficiency. We also analyze the efficiency of our proposed methods on the IWSLT hardware-aware task. Figure 7 shows the loss curves on the validation set of the models found by our method with different sampling strategies. We observe that RankNAS is compatible with different strategies. Also, the evolutionary algorithm outperforms random search in terms of the rate of convergence and the search result. et al., 2020). A common approach to accelerating the search process is to use a proxy, such as reduced model size, training data, or training steps. However, it is inaccurate for estimating the model's performance and diminishes the NAS quality (Baker et al., 2018;Dudziak et al., 2020). Another popular way is to share parameters among all architectures to reduce the training time (Tan et al., 2019;Cai et al., 2019). However, it is infeasible to train all architecture candidates fairly to obtain their accurate performance. Recent works explored performance prediction based on architectural properties, i.e., the network topology and the model size (Liu et al., 2018;Long et al., 2019;Wen et al., 2020;Ning et al., 2020). For instance, Hardware-Aware Transformer (HAT)  encoded architectures into feature vectors and predicted the latency with a Multilayer Perceptron (MLP) for the target hardware. BRP-NAS (Dudziak et al., 2020) proposed an end-to-end performance predictor based on a Graph Convolutional Network (GCN). Although these methods greatly improve the performance estimation efficiency, they still require many samples and train numerous neural networks to converge, thereby increasing the search cost. Instead, we are motivated by the fact that NAS is expected to distinguish different candidate architectures. Thus, NAS can be solved by learning pairwise ranking rather than obtaining the accurate performance of architectures.

Conclusion
We have presented RankNAS, a simple yet efficient NAS algorithm for both hardware-aware and highaccuracy architecture search. We have shown that pairwise ranking can significantly improve search efficiency. We also have proposed a search space pruning method to help the ranking model be more efficient during the search. Our approach outperforms prior methods in both efficiency and accuracy. RankNAS requires 80% less time in ranking model training on the hardware-aware search task and accelerates the overall search process by 11.53 times. Also, the architectures discovered by our method outperform state-of-the-art Transformer models in terms of efficiency and accuracy. A.2 Visualization of Good Architectures Figure 8 illustrates one of the discovered Transformer architecture. The presented architecture achieves 36.2 BLEU on the IWSLT'14 De-En translation task and has a latency of 77ms on the GTX 1080Ti GPU, outperforming the vanilla Transformer by +1.8 BLEU and 2.6 times speed.