ATFormer: A Learned Performance Model with Transfer Learning Across Devices for Deep Learning Tensor Programs

The training and inference efﬁciency of ever-larger deep neural networks highly rely on the performance of tensor operators on speciﬁc hardware platforms. Therefore, a compilation-based optimization ﬂow with automatic tensor generation and parameter tuning is necessary for efﬁcient model deployment. While compilation-based methods with performance models can provide dynamic and suitable code optimization, they suffer from a large design space exploration with rough measurement accuracy and poor transferability among different hardware platforms. This paper presents ATFormer, a simple yet efﬁcient design with attention-inspired modules to accurately predict the performance of optimized operators by capturing global and long-range dependencies within a complete scheduling space. Compared with state-of-the-arts, ATFormer can predict the optimal implementation of tensor operators to reduce inference time with minimal effort on modern DNN benchmarks. Furthermore, AT-Former with pre-trained parameters can quickly adapt to different workloads and hardware via transfer learning.


Introduction
Recently, there has been a significant improvement in model performance for deep neural networks (DNNs) (He et al., 2016;Sandler et al., 2018;Shan et al., 2021;Devlin et al., 2019;Wu et al., 2019;Biten et al., 2019;Bello et al., 2019).However, this progress has been accompanied by a significant increase in the number of operators and, consequently, the computational complexity of DNNs.As a result, it has become increasingly challenging to efficiently deploy DNNs with optimized tensor programs on certain hardware accelerators like CPUs, GPUs and TPUs (Jouppi et al., 2017).
To overcome the limitations, mainstream searchbased tensor compilers (Chen et al., 2018a;Zheng et al., 2020;Bai et al., 2021;Li et al., 2020;Fegade et al., 2021) are developed.These compilers automatically search for the optimal deployment configuration of each operator on increasingly heterogeneous platforms.Conducting on-device measurements is extremely time-consuming, making it impossible to place all the generated tensor programs on the target platform for measurement during the compilation process.Therefore, the prediction via an optimal cost model is crucial in reducing the time-consuming measurements during the compilation which can significantly improve search efficiency and quality.
Nevertheless, the existing cost models are capable of selecting nearly optimal configurations but suffer from excessively long optimization time.These long optimization times not only impede the deployment period but also raise concerns about the practicality of search-based compilers.Furthermore, statistic cost models trained on one hardware platform exhibit significant performance degradation on different hardware, making them unusable across different platforms.It is noteworthy that the execution times of tensor programs can vary signif-icantly on different platforms due to domain gaps, making it challenging to deploy optimized models on multiple platforms.This is further compounded by the significant differences in the features extracted from various platforms.Even when extracted on GPUs, the feature's stability and performance cannot be guaranteed across different GPU architectures such as Volta, Turing, and Ampere.Therefore, additional engineering efforts are necessary to account for the differences in hardware architectures, resulting in a laborious and cumbersome feature extraction process.
To address these challenges, we propose a powerful yet simple approach that uses attention-inspired blocks to enhance the performance of cost models.These blocks can capture global and long-range dependencies among tensor program statements.Additionally, transferable features with pre-trained parameters are used to expedite search convergence across different hardware platforms.These techniques can be easily incorporated into existing search algorithms and improve efficiency in an endto-end fashion.Our design, ATFormer, consistently outperforms popular DNN benchmarks, including small and large-scale models.Furthermore, our techniques enable cross-platform transfer learning, resulting in more efficient deployment.
The main contributions of this paper are the following: (i) We highlight the limitations of current auto-tuning frameworks.Existing tree-based performance models are insufficient for evaluating inference in a large search space and transferable knowledge is difficult to acquire across different platforms.(ii) A simple yet efficient design that utilizes attention-based blocks to explore the correlation between all innermost non-loop statements in a full tensor program, resulting in accurate prediction.(iii) Our approach enables rapid adaptation of performance tuning across various GPU platforms using pre-trained parameters on static datasets, not only in cross-operator but also crossplatform scenarios.Comprehensive experiments on modern DNN benchmarks and the large-scale TenSet (Zheng et al., 2021) demonstrate the consistent and superior performance of our method.

Background and Related Work
Deep Learning Compiler.Recently, the development of compiler-based optimization frameworks, such as Halide (Adams et al., 2019), TVM (Chen et al., 2018b), XLA (Sabne, 2020), and TACO (Kjolstad et al., 2017), has progressed rapidly.These optimization schemes typically consist of two parts: DL framework frontends and code generation backends, as illustrated in Figure 1.The frontend converts an input model into a high-level graph-based intermediate representation (IR) and applies target-independent optimizations, such as operator fusion and data layout transformation.In the backend, target-dependent optimization passes, along with hardware features, further optimize the final performance.TVM (Chen et al., 2018a) is a state-of-the-art search-based tensor compiler that is widely used in academia and industry.Its autotuning aims to achieve performance comparable to hand-tailored libraries and has achieved promising results.TVM has two versions of auto-tuning: AutoTVM (Chen et al., 2018c) and Ansor (Zheng et al., 2020).While AutoTVM is a semi-automated framework that requires pre-defined manual templates, Ansor is more advanced and fully automated.However, both frameworks need to collect data on-the-fly during the search, resulting in an extremely long compilation time.
Tree-based Performance Model.Decision trees are frequently used in classification and regression problems.To enhance their performance, an ensemble learning approach is typically employed to reduce variance.XGBoost (Chen and Guestrin, 2016a) and LightGBM are powerful feature models in sequence modeling tasks.To achieve accurate prediction, a number of works, including (Chen et al., 2018c;Zheng et al., 2020;Ahn et al., 2020;Gao et al., 2021;Bai et al., 2021Bai et al., , 2023;;Huang et al., 2023;Zhao et al., 2023), use XGBoost as the performance model during the tuning.AutoTVM extracts domain-specific features from a provided low-level abstract syntax tree (AST).During optimization, these features, which include loop structure information and generic annotations, are explored.Moreover, TreeGRU (Tai et al., 2015) recursively encodes a low-level AST into an embedding vector, which is mapped to a final predicted score within a fully-connected layer to enhance performance.Halide (Adams et al., 2019) builds regression models with hardware-specific features for auto-scheduling.TabNet (Arık and Pfister, 2020) uses sequential attention to select the most salient features to reason at each decision via a deep tabular architecture.
DNN-based Performance Model.In contrast, some recent approaches aim to reduce the impact of search algorithms on final performance by utilizing more robust and powerful cost models.(Kaufman et al., 2020) and (Sun et al., 2022) employ graph neural networks to predict the latency of DNNs on TPUs.(Steiner et al., 2021) formulates the tuning process as a deterministic Markov Decision Process (Xiang et al., 2015) and solves it by learning an approximation of the value function.Tiramisu (Baghdadi et al., 2019) manually extracts 2534 features from the structure of AST, and forwards the AST as a computation stream to propagate features during the training.These models are trained effectively on a dataset with only a few thousand schedules using the hardware-dependent features crafted by heavy feature engineering techniques.However, complex feature engineering can become problematic in such cases.As hardware-specific features are difficult to transfer to a new platform, a learned performance model trained on one hardware platform typically performs poorly on another.This leads to an issue we call cross-hardware unavailability.Additionally, this approach cannot keep pace with the rapid development of new hardware, which further exacerbates the problem.

Problem Formulation
We describe a DNN model as a computation graph and then define some important terminologies.
Definition 1 (Subgraph).Computation Graph G is partitioned into a set of subgraphs S based on the graph-level optimizer (Roesch et al., 2018).
Each search task is extracted from an independent subgraph S i on a specific hardware platform H. Thus, we define search task Q as follows: ) where n is the number of subgraphs in G.Note that each subgraph S i contains a computation-intensive operator σ and σ ∈ S i .Therefore, we use Q i to represent the i−th search task in G.Each subgraph S i has its own search space, which is determined by the input and output shapes, data precisions, memory layout, and the hardware platform.The search space is usually large enough to cover almost all kinds of tensor candidates.are dependent on the hardware platform.Each tensor program can be considered as a candidate in the search space.We define the hierarchical search space φ 1,2 , which decouples high-level structures φ 1 from low-level details φ 2 , allowing for the efficient exploration of potential tensor candidates during the tuning process.
Here, we can transform a tuning problem into an optimization problem that explores the potential tensor programs in a hierarchical search space.
Problem 1.Given code generation function ð, high-level structure generation parameters φ 1 , lowlevel detail sampling parameters φ 2 , computationintensive operator σ and operator setting k (e.g., kernel size), our goal is to use φ 1,2 to build a hierarchical search space and generate tensor program p to achieve the optimal prediction score y * on a specific hardware platform H. (2) The cost model f predicts score y of the tensor program p.The accuracy of the cost model f is crucial in finding ideal optimization configuration.

Performance Model
The process of optimization using our design is outlined in Algorithm 1.The input is a set of tobe-optimized operators or subgraphs with different configurations.To implement our workflow, three functions are defined: GenerateHighSketch(), Sampling(), and EvolutionSearch(), as shown in Algorithm 1. GenerateHighSketch() takes φ 1 , σ, and k as input and returns the high-level generation sketch GS 1 as output.Sampling() takes GS 1 , φ 2 , σ, and k as input and returns the low-level annotation samples GS 2 as output.EvolutionSearch() takes the high-level generation sketch GS 1 and the low-level annotation samples GS 2 as input and returns a group of tensor candidates for the cost model training.Next, an evolutionary search strategy is used along with a learned cost model to fine-tune the performance of the generated tensor programs.By iteratively mutating high-quality tensor programs, it can generate new programs with potentially higher quality.After a number of measurement trials, the best tensor program configurations can be identified.Hierarchical Feature Generation.The input of ATFormer is a series of mix-grained feature vectors extracted from p σ , where p σ is the full tensor program to implement operator σ.Each vector represents a single computation statement within p σ .These mix-grained feature vectors are composed of two important components: (i) Coarse-Grained operator embedding features that capture the high-level structure of the operator σ and (ii) Fine-Grained statement features that capture the low-level details of each statement within program p σ .Each operator in the subgraph S can be classified into a few categories, and we represent each operator with a one-hot embedding feature vector that covers all possible operator types.In practice, we use feature vectors of length 10 for the operator embedding and length 164 for the statement features, consistent with the approach used in Ansor (Zheng et al., 2020).The prediction score for a subgraph is computed as the sum of the prediction scores for each innermost non-loop statement within the loop nests of the full tensor program.More details can be found in Figure 2. Model Architecture Our proposed ATFormer model consists of three layers: (i) a kernel embedding layer, which extracts a compact feature representation; (ii) a computation processing layer, which captures essential information from the innermost non-loop computation statements in the neighborhood; and (iii) a simple regression layer for making the final prediction.ATFormer can be easily integrated into existing search algorithms and consistently improve the efficiency of autotuning.We believe that the simplicity of our method will attract more research attention to the field of tensor operator optimization, further enhancing training and inference efficiency.The feature processing of computation and regression in ATFormer is illustrated in Figure 3.The kernel embedding layer is composed of two fully connected layers with ReLU activation.The function of the kernel embedding layer is to project the features from low dimension space to a new embedding space for similarity measurement.Starting from the batched tensor programs I ∈ R L×D in representing a specific type of operator σ, where L is the accumulated number of the feature statements within I. A kernel embedding layer then generates a set of feature statements E ∈ R L×Dout in embedding space.Typically, we use D out = 512.The value L is determined by the parameters of high-level structures φ 1 and the low-level details sampling φ 2 for each subgraph S.
As for the computation layer, a set of feature statements E ∈ R L×Dout should be split into M stacks of feature statements Z ∈ R M ×N ×Dout firstly.Each stack contains N feature statements of innermost non-loop computation within a full tensor program p.We adopt the self-attention mechanism for feature statements aggregation.With the parameter tensors written as W Q , W K , W V , a full tensor program with a set of innermost nonloop feature statements Z is first encoded into query Q, key K, and value V by three identical linear transformations: Q, K, V = Z W . Then it will be further calculated by the self-attention  layer as: (3) The final prediction of these M tensor programs is computed by a regression layer with a dimension from 512 to 1.The predicted score is y ∈ R M ×1 .Loss Function The model ranks the performance of potential candidates in a large search space.Therefore, the model can be trained with ranking losses or regression losses to predict relative or absolute scores.To explore the loss function to train ATFormer, a common choice is to use the squared error function as a regressor which can mostly care about identifying the well-performing tensor programs.The loss function of the model f on a full tensor program p with throughput h is MSELoss(f, p, h) = ( s∈S(p) f (s) − y) 2 , where S(p) is the set of innermost non-loop computation statements in tensor program p.We train ATFormer as the performance model f .However, we only care about the relative order of tensor program runtime rather than their absolute values during the compilation.We instead use the following Ran-kLoss (Cao et al., 2007) to rank the performance of candidates in the large design space.This can fully exploit the optimal candidates to reduce the impact of the search algorithm on final prediction results.The loss function is defined as follows: (5) We can use the prediction f (x) to select the topperforming implementations of a full tensor program p.The computation graph G is trained for tensor programs extracted from all subgraphs.The throughput of all tensor programs is normalized to be in the range of [0, 1].

Transfer Learning
The trade-off between search time and performance improvement is interesting to explore and exploit, as long search times may not always be acceptable.
Our current focus is on developing a cost model for optimizing tensor operators on a specific hardware platform.However, in practical settings, we require a cost model that can be used across various hardware platforms.This would allow us to reuse a single-cost model for multiple platforms by providing it with new online data during auto-tuning.To achieve this, we pre-train the cost model with an offline static dataset and exploit transferable features that are invariant to both source and target domains to speed up the optimization process, as depicted in Figure 4.The use of transferable features greatly contributes to the success of transfer learning, as different designs may have varying degrees of invariance.By training the cost model offline using a dataset, we can significantly reduce the frequency of on-device measurements and use the pre-trained parameters as a starting point for new search tasks via transfer learning.

End-to-End Execution Evaluations
Workloads.We evaluate the performance of AT-Former on various DNNs, including small and large-scale models.For small-scale models, we use AlexNet, VGG-16, MobileNet-V2, ResNet-18/50 and Bert-Tiny to evaluate the design.As for the large-scale models, we use BERT and GPT-3 models, specifically BERT base , BERT large , GPT-2 large and GPT-3 350M .We report the the end-to-end inference latency with batch size 1 on RTX 2080Ti.Baselines and Settings.For statistic model, we use XGBoost as a baseline which has proven to be a state-of-the-art feature-based model in auto-tuning framework (Zheng et al., 2020)

Transfer Learning Evaluations
As mentioned in Section 3.3, we use RTX 2080Ti and 3090 GPUs as different platforms to verify our design by two typical metrics: i) Fix the measurement trails and compare the total latency and ii) Fix a converged latency, and then compare the search time to reach it.To explore transferable features and fast adaptation of auto-tuning between different hardware platforms, ATFormer is pre-trained with a number of samples from TenSet and then finetuned using online datasets on different platforms.Therefore, we divide our experiment settings into "traditional learning" and "transfer learning" parts.
Traditional Learning.In  on 3090 GPU.The results show that self-attention based models perform best in the final performance compared with the tree-based and DNN-based cost models on different types of GPUs.Transfer Learning.In Table 1, experiment results on RTX 2080Ti and 3090 show that the pretrained parameters make the search convergence much faster.With the increasing number of training tasks in the offline dataset from 50 to 500, the learning ability of cost models with self-attention blocks, including MHA, ATFormer-1L, and ATFormer-Mask, become more stable, and they can adapt to the new tasks via transfer learning.ATFormerseries model performs better than the statistic and DNN-based model XGBoost, LSTM in optimized total latency with the parameters trained from TenSet-100 to TenSet-500.All large-scale models are exported from Hugging Face, with a batch size of 1 and a maximum input sequence length of 512.As shown in Table 2, ATFormer achieves latency speedups of 1.39×, 1.11×, 1.10×, and 1.16× on the 3090 GPU compared to PyTorch runtime.In terms of end-to-end tuning time, AT-Former achieves speedups of 4.97×, 5.10×, 5.69×, and 6.08× compared to traditional learning.
The performance of our efficient transfer learning on NVIDIA RTX 3090 GPU can be found in Figure 7.As for the TenSet-50 datasets, curves start from different points at the beginning, and we can find that XGBoost performs best.It means that the transferable features in the ATFormerseries models are not fully exploited on the limited dataset (task#50) during the training.Obviously, the adaptation skills amplify rapidly with the increasing number of tasks on the offline dataset.From TenSet-100 to TenSet-500, we can find that ATFormer-series models show fast adaptation and generalization ability across hardware platforms and operators compared with XGBoost and LSTM models.
In Table 3, we make the traditional learning and

Ablation Study
Various designs are evaluated in this section.We report the performance about total latency, search time on ResNet-18 and MobileNet-V2 and accuracy on the static datasets.Loss Functions.Convergence Speed.In Table 4, method (d) is the proposed ATFormer, which adapts the pre-trained parameters to the new task via transfer learning into method (c).Note that ATFormer with the pretrained parameters minimizes the total latency of all subgraphs in three DNNs as much as possible and the search time as quickly as possible.The proposed ATFormer improves the total latency by 4.66× speedup and convergence speed by 1.55× speedup.Method (f) is the AutoTVM with lamb-daRank loss function.The performance is inferior to the baseline configuration.
Training Schemes.In Table 4, method (c) incorporates the mask module into method (b) during traditional learning.Method (d) imports the mask module into method (e) during transfer learning, resulting in a notable increase in convergence speed.It's worth noting that adding a mask scheme during traditional learning is not very helpful and can even cause a decrease in the total latency.However, for transfer learning with pre-trained parameters, incorporating the mask module is crucial for achieving faster convergence speed.The introduced techniques do not require expensive training resources in terms of both time and computation power.Model Architectures.Table 5 lists ATFormer with various architectures.To achieve high accuracy while minimizing the model parameters, we find that the self-attention block, which contains four heads with 512 hidden dimensions, performs the best on the total latency and search time.Note that ATFormer does not benefit from deeper encoder layers in the Transformer model.Thanks to its simple and efficient architecture, the inference latency of ATFormer is consistently lower than that of the DNNs it optimizes.Thus, we set the two encoder layers as the final decision.equipped with an Intel Core i9-12900K CPU, a NVIDIA GeForce RTX 3090 GPU, and a 2TB hard disk.Table 8 presents the specific training times (in seconds) of the ATFormer series models on static datasets.Note that our approach is also suitable for scenarios involving large batch sizes.Table 9 lists experimental results using batch size 8 on the NVIDIA 3090 GPU via traditional learning.

Conclusion
This paper introduces ATFormer, a novel and effective design for optimizing tensor programs.AT-Former employs hierarchical features with varying levels of granularity to model the end-to-end compilation.Moreover, self-attention blocks are utilized to explore global dependencies of a complete tensor program for high-quality evaluation.Through transfer learning, ATFormer achieves faster-converged latency and superior transferability across different hardware platforms, outperforming previous state-of-the-art benchmarks.
Limitations.We plan to do the transfer learning from GPUs to CPUs and explore the potential of combining with post-training quantization or pruning to efficiently deploy models.Additionally, we will explore more universal and efficient methods for optimizing tensor programs with ATFormer.This includes leveraging hardware features to optimize performance on domain-specific accelerators, such as NVIDIA's Tensor Cores.

A Appendix
A.1 Feature Extraction Details The feature before ATFormer training can be represented as two different granularities: coarsegrained and fine-grained levels.The coarse-grained level feature can describe each search task in the computation graph.It has 10 elements with the one-hot encoding pattern.In our specific code implementation, the coarse-grained vector contains these operators:"max", "min", "add", "Conv2dOutput", "Conv2d_winograd", "Depth-wiseConv2d", "dense", "softmax", "compute(b, i, j)".The "max" and "min" can represent some activation functions in deep learning."dense" means the fully connected layer in computation graph and "compute(b, i, j)" is a very important function to implement each tensor operation in deep learning.
If the intermediate representation about some operators are fused into the same "compute(b, i, j)" primitive, it means these operators are fused together and can run very efficiently on the specific hardware platforms.As for the fine-grained vector, the length of it including all the listed features for one statement is 164.We use the same set of features for both Turing 2080Ti and Ampere 3090 GPUs.It can be summarized as follows: • Number of float operations: The number of addition, subtraction, division, modulo operation, less-than, greater-than, intrinsic math function such as exp, sqrt.
• Number of integer operations: Similar to the number of float operations, but for the operations with integer operations.
• Vectorization related features: The number of the innermost vectorized loop statements in a full tensor program.
• Unrolling related features: The number of the innermost unrolling loop statements in a full tensor program.
• Parallelization related feature: The number of the innermost parallelization loop statements in a full tensor program.
• Arithmetic intensity curve: We only sample 10 points from a curve which is defined as F LOP s Bytes which is similar to the roof-line model used in computer architecture.It can help us to recognize the type of the search task or operator in computation graph such as compute-intensive or memory-intensive operator on a specific hardware platform.
• Allocation related features: The size of the allocated buffer for the output results of each statement in a full tensor program.
With the combination of coarse-grained and finegrained feature vectors, we can construct them into a hierarchical feature vector to take full advantage of each statements in a full tensor program.

A.2 Implementation Details
ATFormer is implemented on the top of Ansor and evaluated from two aspects: end-to-end search efficiency and quality, as well as performance portability.We compare ATFormer against the state-ofthe-art methods, including both the statistic and DNN-based cost models.The items labeled with XGBoost represent the Ansor default configuration.We also provide a detailed ablation study of the model architecture, accuracy, loss function, convergence speed, and training scheme, with insights and qualitative results.The generated tensor programs are evaluated on two different GPU architectures: Turing RTX 2080Ti and Ampere RTX 3090, with float32 data types used for all evaluations.We train the cost model using the Adam optimizer for 50 epochs, with a starting learning rate of 7e −4 that decays to 1e −6 , and a training batch size set to 512.We use TVM v0.8dev in TenSet (Zheng et al., 2021), LLVM 11.0, and CUDA 11.0 for compilation, while XGBoost 1.5.0 and PyTorch 1.7.1 are used for training models.The use of a "mask" is a widely adopted technique for training transformers.In Figure 5, each tensor program is transformed into a sequence of vectors, with each vector representing a tensor computation statement.During training, all sequences are of the same length, and any shorter sequences are padded with zeros at the end.The padded items are masked out and excluded from the loss computation.Our ablative models, including MHA, ATFormer-1L, ATFormer, and ATFormer-M, were also experimented with.MHA is the basic Multi-Head Attention layer, ATFormer-1L only has one encoder layer, ATFormer has two encoder layers, and ATFormer-M uses the "mask" scheme during training.

A.3 Dataset Details
We evaluated our design using TenSet, a large-scale and challenging dataset for search-based tensor compilers.TenSet comprises 52 million performance records of tensor programs obtained from real measurements on different hardware platforms.
Various randomly generated tensor programs for popular workloads are compiled via the TVM compiler and executed on the target hardware platforms.
To ensure the inclusion of diverse workloads essential for generalization ability, we collected tensor programs from 120 networks with 13,848 tasks on the NVIDIA Tesla T4 GPU.This dataset serves as a series of static offline datasets.

A.4 Benchmark Details
We evaluate the performance of generated programs by ATFormer on two levels: end-to-end network evaluations and performance portability via transfer learning.For each level of evaluation, we compare ATFormer against the state-of-the-art methods, including the statistic models: • XGBoost (Chen and Guestrin, 2016b) • LightGBM (Ke et al., 2017) and DNN-based models: • LSTM (Hochreiter and Schmidhuber, 1997) • Multi-Head Attention (Vaswani et al., 2017) • TabNet (Arik and Pfister, 2021) The generated tensor programs are benchmarked on two different architecture GPU platforms: • NVIDIA 2080Ti GPU with Turing architecture (Jia et al., 2019) • NVIDIA 3090 GPU with Ampere architecture (Choquette et al., 2021) Figure 8: Convergence analysis on ResNet-18.
We use float32 as the data type for all evaluations.We train our model with the Adam optimizer for 50 epochs with a starting learning rate of 7e −4 , the learning rate decays to 1e −6 , and the training batch size is set to 512.We use TVM v0.8dev in TenSet, LLVM 11.0 and CUDA 11.0 for compilation.Meanwhile, we use XGBoost 1.5.0 and PyTorch 1.7.1 for training models.
To explore transferable features and fast adaptation of ATFormer between different hardware platforms, ATFormer is pre-trained using offline learning with a number of samples from TenSet, and then fine-tuned using online learning on different platforms.For the offline learning, we randomly sample 50, 100, 200, 300, 500 search tasks from TenSet NVIDIA Tesla T4 GPU.
We train 40 models including XGBoost, Light-GBM, LSTM, TabNet, Multi-head attention, ATFormer-1L, ATFormer, ATFormer-Mask for all of experiment evaluation in this paper.Due to the limitation of maximum file size (100MB) in supplementary material, we release the pre-trained model offline learning by Tenset-500 for AFTormer-1L, ATFormer, ATFormer-Mask, Multi-head attention and TabNet.All of the pre-trained models for XG-Boost.And we release running scripts in the supplementary material to reproduce the results in Section 5 Table 1.More details about the hyperparameters of each cost model in our experiments can be found in Table 12, Table 13, Table 14, Table 15, Table 16,  Table 17, and Table 18.A.5 Convergence Analysis.
In Figure 8, we present the tuning trials-latency curves that illustrate various stages of auto-tuning with different configurations on ResNet-18.We performed four types of experiments on ResNet using two settings: with transfer learning and without transfer learning.The blue line indicates ATFormer with transfer learning to expedite the tuning pro-

Figure 1 :
Figure 1: The overview of a search-based framework with computation graph, cost model, and search space.

Figure 2 :
Figure 2: Hierarchical features of Conv2D with a full tensor program representation in the search space.

Figure 3 :
Figure 3: The performance model's architecture includes two attention blocks that extract coarse and finegrained features of the tensor program, as well as a lightweight MLP layer for directly predicting the score.
Search space φ1, φ2 with operator σ and setting k.Output: Tensor program p * with best configuration c * . Input: . For DNN-based , we use LSTM with eight heads and 1024 hidden dimensions, and TabNet is implemented in TenSet as another baseline.Note that the search algorithm uses the default configurations, and the search terminates when it runs out of allowed measurement trials.We keep the rest of the factors the same for a fair comparison.Main Results.Figure6shows the final optimized total latency results on the RTX2080Ti GPU.Overall, the ATFormer-series model performs the best in all cases.Compared with the tree-based model XGBoost, ATFormer outperforms them in all cases with 1.15 − 1.61× speedup.Compared with the DNN-based model TabNet, ATFormer outperforms them in all cases with 1.14 − 2.14× speedup.Compared with LSTM, ATFormer performs equally the best and achieves 0.96 − 1.48× speedup.Although LSTM surpasses ATFormer a little in finding the best configuration on Bert-Tiny and VGG-16, the amount of computation that can be parallelized in ATFormer leads to a shorter used time.Overall, the experiment results from the GeoMean verify the effectiveness of the attention-based modules over the tree-and DNN-based performance models. learning

Table 1
, ATFormer achieves the best total latency on RTX 2080Ti, and it performs almost equally best with ATFormer-1L about total latency with a fixed measurement trail

Table 1 :
Transferable adaptation evaluation between different GPU platforms on ResNet-18.

Table 2 :
The performance of large-scale Transformer models on TenSet-500 with transfer learning.

Table 3 :
Pre-trained models on TenSet-500 via transfer learning with converged latency on GPU platforms.

Table 4 :
Total latency and tuning time of different methods, using ResNet-18, MobileNet-V2 and Bert-Tiny networks for end-to-end evaluation.The relative gains obtain for batch size = 1 with 300 measurement trials.

Table 5 :
Different architecture design about ATFormer.

Table 6 :
Hierarchical features and performance model architecture improvements for end-to-end evaluation.
Table 4 shows two different loss functions in our experiments.Method (a) is AT-Former with Root Mean Square Error (RMSE) loss function while method (b) is with lambdaRank loss function.Compared with method (a) and method (b), we find that lambdaRank loss always outperforms RMSE in our design for different workloads of DNNs.It shows that the goal of a decent cost model is to rank the performance of different tensor programs by relative scores in a given search space.

Table 7 :
Accuracy of the cost models on TenSet.

Table 8 :
The training time of the ATFormer series cost models during the offline optimization.

Table 9 :
Traditional learning with different cost models for batch size 8 on the NVIDIA RTX 3090 GPU.