Generative Table Pre-training Empowers Models for Tabular Prediction

Recently, the topic of table pre-training has attracted considerable research interest. However, how to employ table pre-training to boost the performance of tabular prediction remains an open challenge. In this paper, we propose TapTap, the first attempt that leverages table pre-training to empower models for tabular prediction. After pre-training on a large corpus of real-world tabular data, TapTap can generate high-quality synthetic tables to support various applications on tabular data, including privacy protection, low resource regime, missing value imputation, and imbalanced classification. Extensive experiments on 12 datasets demonstrate that TapTap outperforms a total of 16 baselines in different scenarios. Meanwhile, it can be easily combined with various backbone models, including LightGBM, Multilayer Perceptron (MLP) and Transformer. Moreover, with the aid of table pre-training, models trained using synthetic data generated by TapTap can even compete with models using the original dataset on half of the experimental datasets, marking a milestone in the development of synthetic tabular data generation. The codes are available at https://github.com/ZhangTP1996/TapTap.


Introduction
Recently, pre-trained language models (LMs) have attracted a lot of research interest in different domains, especially in the area of natural language processing.After pre-training on a large-scale unstructured text corpus with a self-supervised training objective, e.g., masked language modeling (MLM) proposed by BERT (Devlin et al., 2019), LMs can significantly benefit downstream tasks.Furthermore, recent progress on generative LMs (Radford et al., 2019;Raffel et al., 2020;Lewis et al., 2020) suggests that it is possible to unify different tasks via one LM.The remarkable success   Question: How many people with a bachelor's degree earn less than 50K?Hypothesis: More than 20% has a HS-grad or higher education.
of pre-trained LMs has inspired much research in pre-training over structured tables, one of the most common types of data used in real-world applications (Benjelloun et al., 2020).Different from text, tables usually contain rich and meaningful structural information, and thus LMs on text corpus are not well suited for tabular data.To this end, there has been a growing amount of recent work on table pre-training (Herzig et al., 2020;Yin et al., 2020;Wang et al., 2021b;Liu et al., 2022).However, the vast majority of existing table pre-training works aim to enhance joint reasoning over text and table (e.g., table question answering, tableQA), while neglecting tabular prediction, an important task in real-world applications.The goal of tabular prediction is to predict a specified target (e.g., the income) based on a set of features (e.g., the age and the occupation).As illustrated in Figure 1, most pre-trained LMs on tables such as TAPAS (Herzig et al., 2020) typically apply MLM variants on crawled tables and text segments to boost their joint reasoning capability in tableQA.
Nevertheless, as of yet, there is little evidence that these table pre-training methods can enhance the performance of tabular prediction tasks.This is probably because tabular prediction tasks are quite challenging.In contrast to the exceptional performance of deep learning in many domains, recent studies (Shwartz-Ziv and Armon, 2022;Gorishniy et al., 2021) question the necessity of deep learning models for tabular prediction, as their performance is usually outperformed by traditional machine learning models.To summarize, it is still an open challenge to employ table pre-training to boost models for the tabular prediction task.
In this paper, we present TAPTAP (Table Pretraining for Tabular Prediction), which is the first attempt that leverages pre-training of language models on tables to significantly benefit tabular prediction tasks.To benefit different backbone models, we apply table pre-training from a data perspective, i.e., we utilize TAPTAP to synthesize highquality examples that can be used to train backbone models.Based on the widely used generative language model GPT (Radford et al., 2019), after ongoing pre-training on a large-scale corpus of realworld tabular data, TAPTAP is expected to capture a generic tabular data distribution.Then, TAPTAP can be quickly adapted to downstream tables via fine-tuning and can generate high-quality synthetic tables to support various applications on tabular data, including privacy protection, low resource regime, missing value imputation, and imbalanced classification.Meanwhile, such a design decouples the backbone model from the pre-trained model architecture, allowing TAPTAP to benefit different backbone models.Extensive experiments on 12 public datasets demonstrate that generative table pre-training can empower models on tabular prediction in various ways, and TAPTAP outperforms a total of 16 baselines in different scenarios and supports three state-of-the-art (SOTA) backbone models.The contributions of this paper can be summarized as follows: • To our knowledge, we are the first to successfully apply table pre-training of language models to tabular prediction.With carefully designed generation strategies, our method combines the advantages of backbone models for tabular prediction and pre-trained LMs.
• To accomplish the pre-training, we collect and filter out 450 public tabular datasets from Kaggle, UCI, and OpenML platforms, and finally construct a large-scale pre-training corpus. 2 Related Work  (Tang et al., 2021;Wang et al., 2021b;Deng et al., 2021).Our work is different from theirs because we focus on the application of table pretraining on tabular prediction.There were some previous studies that performed tabular pre-training for tabular prediction (Wang and Sun, 2022;Arik and Pfister, 2021;Yoon et al., 2020;Bahri et al., 2022).The major difference between TAPTAP and the previous studies is that, TAPTAP performs cross-table pre-training on language models using a large number of tables to leverage the knowledge embedded in language models, and previous works usually perform single-table pre-training (Arik and Pfister, 2021) (or few tables with lots of overlapped columns (Wang and Sun, 2022)) on models specifically designed for tabular data.Choi et al., 2017;Park et al., 2018;Mottini et al., 2018;Xu et al., 2019;Koivu et al., 2020) or variational autoencoders (Xu et al., 2019;Ma et al., 2020;Darabi and Elor, 2021).However, it is hard for these methods to leverage the textual semantics in tables.More recently, GReaT (Borisov et al., 2022) has successfully applied LMs in generating synthetic tabular data.There are some significant differences between GReaT and TAPTAP: 3 Methodology

Preliminary of Tabular Prediction
A tabular data usually contains two parts, the features and the label.Given the features as the input, the goal of tabular prediction is to predict the label.Taking the example from Figure 1, the task is to predict the income (label) of a person based on her / his age, education and occupation (features).
Below we formalize tabular prediction using the binary-classification task, and the formulation can be easily extended to multi-class classification or regression problems.Formally, a tabular data with n samples (i.e., rows) and m features (i.e., columns) can be represented by D ={(x i , y i )} i=1,...,n where The j-th feature has a feature name f j (e.g., "age").A model F takes the features x i as input to predict the label y i .Our goal is to train a model such that the test error is as small as possible.
Existing works on improving F either design better model architectures (Gorishniy et al., 2021) or improve the quality of training data (Zhang et al., 2022).We follow the second path to improve the model performance by generating synthetic data.There are four typical scenarios where high-quality synthetic samples are helpful: (1) Privacy protection (Gascón et al., 2017).In many application domains, each party only has part of the dataset and several parties can collaboratively train a model on a joint dataset.But tabular data usually contains sensitive personal information or confidential business secrets that cannot be directly shared with other parties.In this case, TAPTAP can be used to generate synthetic data D s to replace the real data D, while achieving similar model performance.(2) Low resource regime.Data collection can be very expensive in some applications and hence handling the small data regime is an important challenge.For example, over 44% classification datasets on the UCI platform (Asuncion and Newman, 2007) have less than 1000 samples.In this case, we can leverage TAPTAP to perform data augmentation in order to boost the backbone model.( 3) Missing value imputation.Missing values are ubiquitous in tabular data (Stekhoven and Bühlmann, 2012).In this case, TAPTAP is able to impute the missing values to improve the performance of the model.( 4) Imbalanced classification.It is common to have a long-tail label distribution in tabular data (Cao et al., 2019).In this case, TAPTAP can be used to balance the class distribution by conditional sampling (from the minority classes).

Overview
As shown in Figure 2, TAPTAP consists of four steps.(1) Pre-training: train an auto-regressive LM on the table pre-training corpus compiled by lots of public tabular datasets.(2) Fine-tuning: train the LM on the downstream table ; (3) Data Sampling: prompt the fine-tuned LM to sample synthetic tables with only tabular features.(4) Data Labeling: assign pseudo labels to the synthetic tables via downstream backbone models.Below we describe these steps in details.

Pre-training
Corpus Construction To build the pre-training corpus, we leverage publicly available tabular datasets from Kaggle 1 , UCI (Asuncion and Newman, 2007), and OpenML (Vanschoren et al., 2013) platforms.We believe the table pre-training should be performed on tabular data with rich semantic information, therefore we eliminate datasets with meaningless column names (e.g., V1).After the filtering, we finally collect 450 tabular datasets with a total of nearly 2 million samples.To illustrate it better, we show in Figure 3 a word cloud composed of feature names and feature values.Note that we are careful to guarantee that the tabular 1 https://www.kaggle.com/datasets used in pre-training and the downstream benchmark datasets are non-overlapping, so there is no data leakage issue.

Textual Encoding Table Serialization
Since TAPTAP starts with the GPT model, we follow the previous work (Borisov et al., 2022;Liu et al., 2022) to serialize each sample into a sequence of tokens to reduce the difficulty of table pre-training.As suggested by Hegselmann et al. ( 2022), we take the text template serialization strategy and serialize samples using the "[Feature] is [Value]" template.Taking the example in Figure 2, the first sample in the fine-tuning table is converted into a sentence "Age is 18, Education is HS-grad, Occupation is Machine-op-inspct, Income is ≤ 50K".Formally, given a table D = {(x i , y i )}, let x i,j be the j-th feature value in x i and f j be the j-th feature name.The textual encoding is to transform the i-th sample x i into a splice of sentences separated by commas t i = (t i,1 , ",", t i,2 , • • • , ",", t i,m ), where t i,j = (f j , "is", x i,j ).
Number Encoding Numerical features (e.g., age) are important and widely used in tabular data -over 70% of features in our pre-training corpus are numerical features, but how to properly encode these features has always been neglected in previous work of tabular prediction.Meanwhile, recent studies on LMs show that they are not good at dealing with numbers (Pi et al., 2022) and suggest the character-level representation is better suited to capture the number semantics than its counterparts (Wallace et al., 2019).Therefore, we use the character-level representation for all numerical features, which means that the phrase "Age is 18" in Figure 2 would be converted into "Age is 1 8".

Permutation Function
The features in the tabular data are not ordered, but they are encoded as an ordered sentence, which introduces spurious positional relationships in textual encoding.In order to reconstruct the order independence among ).Such permutation enables conditional sampling when doing inference on downstream tables (Borisov et al., 2022), i.e., TAPTAP can generate a synthetic sample conditioned on any set of known features.We take a step further to demonstrate that the conditional sampling helps TAPTAP perform well in the missing value imputation scenario.

Pre-training Procedure
As mentioned before, the pre-training follows an auto-regressive manner, i.e., TAPTAP is trained to predict the encoded sentence token by token.Assuming we have q tabular datasets for pre-training, the whole pre-training corpus T can be obtained by combining each tabular data after textual encoding as {t In general, TAPTAP factorizes the probability of generating t in an auto-regressive manner as  (Radford et al., 2019), so that TAPTAP can benefit from the common knowledge already learned by these LMs.

Fine-tuning
Fine-tuning TAPTAP on the downstream table follows a similar procedure as in pre-training.The only difference is that the encoded sentences for fine-tuning are generated by applying textual encoding to the downstream table.

Data Sampling
Given the sequence (w 1 , • • • , w k−1 ) as the prompt, TAPTAP is able to output the categorical distribution of the next token w k ∈ V after fine-tuning, where V denotes the vocabulary.In general, w k is sampled from the conditioned probability distribution p(w Since we also employ permutation during finetuning, the fine-tuned TAPTAP is able to generate synthetic samples given any prompt.We employ three kinds of prompting strategies for different application scenarios (Borisov et al., 2022).
(1) Feature name as prompt.This strategy is used in the privacy protection and low resource regime, where only feature names in the tabular data are selected as the prompt.The synthetic samples are generated by TAPTAP according to the prompt "[Feature] is ".(2) One feature-value pair as prompt.This strategy is used in the imbalanced classification scenario, where the feature names and the minority label(s) are both provided as the prompt.With the label treated as a feature, TAPTAP generates synthetic samples based on the prompt "[Feature] is [Value], ". (3) Multiple feature-value pairs as prompt.This strategy is used in the missing feature scenarios, where the feature names and available feature values are provided as the prompt.TAPTAP generates synthetic samples according to the prompt " The order of the given features in the prompt is random.
Data prompt examples can be found in Figure 2.

Data Labeling
An accurate label is arguably one of the most crucial ingredients in synthetic samples.Noisy labels can severely degrade the generalization capability of backbone models (Gorishniy et al., 2021).In contrast to the previous work relying on LMs to generate labels (Borisov et al., 2022), we propose to assign pseudo labels using the SOTA backbone models.We argue that LMs are not yet the best Here we present the difference in metrics between the model trained on the synthetic data and the one trained on the original data, the lower the better.A gap close to zero suggests that the synthetic data is of comparable quality to the original data.Below the backbone model is LightGBM.Results of MLP and Transformer can be found in Table 13 and 14  Then, the synthetic labels y ′ i can be derived using the well-trained model via Finally, the synthetic labels and the synthetic tabular features make up the final synthetic table The following model analysis in the Section 4.3 reveals that our design of data labeling (i.e., not using LMs for label generation) is crucial for the superior performance of our approach.

Experimental Setup
Datasets and Evaluation Metrics We collect 12 diverse real-world datasets from various domains (Asuncion and Newman, 2007;Vanschoren et al., 2013).Each dataset is split into a train set (75%) and a test set (25%), and all experiments share the same splits.We provide some important statistics of each dataset in Table 1 and more details in Appendix A. Following previous works (Grinsztajn et al., 2022;Borisov et al., 2022), we use accuracy and R2 score as the evaluation metrics for the classification and regression tasks.For the imbalanced classification scenario, we employ AUC as the evaluation metric.All the experimental results are averaged over 10 different random seeds.
Backbone Models To comprehensively evaluate TAPTAP, we experiment with various SOTA backbone models for tabular prediction, including LightGBM (Ke et al., 2017), MLP, and Transformer (Gorishniy et al., 2021).Modern GBDT models (such as LightGBM and XGBoost) have been the most popular models for tabular prediction in the past few years (Shwartz-Ziv and Armon, 2022).We choose LightGBM in our experiments.Recently, MLP and Transformer with piece-wise linear encoding (Gorishniy et al., 2022) are proposed to be competitive against LightGBM.

Main Results
We measure the quality of the synthesized samples by their performance in the application scenarios.
Privacy Protection Following the previous work (Borisov et al., 2022), we include baselines CT-GAN (Xu et al., 2019), TVAE (Xu et al., 2019), CopulaGAN (Patki et al., 2016), GReaT-distill and GReaT (Borisov et al., 2022).All methods are used to generate the same amount of synthetic data as the original dataset.The backbone models are trained on the synthetic data, and then evaluated on the original test set.The experimental results are presented in Table 2.One can observe that TAPTAP and TAPTAP-distill outperform most of the baseline methods.Noticing that GReaT also utilizes GPT2, the fact that TAPTAP surpasses it by a large margin suggests the superiority of table pretraining.More importantly, with table pre-training, the quality of the synthetic data generated by TAP-TAP can even match that of the original data.On half of the privacy protection datasets, LightGBM models trained with our synthetic data achieve almost the same performance as with the original data.This is highly impressive, especially when considering that none of the synthetic samples appear in the original dataset.

Low Resource Regime
We perform data augmentation to mitigate the low resource dilemma.The baseline methods are identical to those in privacy protection.During fine-tuning, following the experience of multi-task learning in T5 (Raffel et al., 2020), we first use the synthetic data to finetune a backbone model.Then, we use the original data to continually fine-tune the model.Experimental results on 9 datasets with less than 30k samples are presented in Table 3, which show that TAPTAP is able to perform comparably or better than all baseline methods on most datasets.Furthermore, TAPTAP contribute significant gains to 4 of the 9 datasets, which is highly non-trivial.
Overall Summarization First, TAPTAP generally improves the performance of different backbone models in tabular prediction and outperforms the majority of baseline methods on various tabular prediction scenarios.Second, the advantage of TAPTAP over TAPTAP-distill suggests that table pre-training can also benefit from scaling up LMs.Third, TAPTAP is the first to successfully generate synthetic data for comparable backbone model performance to original data.

Ablation Study
To investigate the effectiveness of each component in TAPTAP, we conduct an ablation study.We name TAPTAP without different components as follows: ( 1   The experimental results are visualized in Figure 4. We present the average metric values (i.e., Acc. or R2) of each method across 12 datasets in the privacy protection setting, since it is the most straightforward setting to indicate the quality of synthetic data.We can see that pre-training and data labeling are particularly important for TAPTAP.The semantic information in column names and the character-level representation to enhance number encoding also provide considerable improvement.

Analysis
The Scale of Pre-training Corpus Figure 5 illustrates the influence of the pre-training scale on the downstream performance.We present the results with 0.02, 0.1, 0.5 and 2 million samples.As one can observe, scaling up the pre-training corpus brings positive effects.However, the number of high-quality real-world tabular datasets is limited.Therefore, it may be helpful to take advantage of the millions of tables available on the Web.
Pre-training using Web Tables To explore the above direction, we present a preliminary study on using tables from Web for pre-training.We parse over 130k Web tables with a total of 8 million samples from the WikiTables corpus (Bhagavatula et al., 2015).We use the Web tables together with the tabular datasets for pre-training.The results of the privacy protection setting are presented in Table 6.We can see that even with a large number of Web tables, it is still hard to further boost the backbone models.We attribute it to the quality issue.The collected tabular datasets have already been examined by the platforms, and usually have higher quality than noisy Web tables.How to automatically identify high-quality tables from the huge number of Web tables for pre-training is a promising future direction.

Conclusion & Future Work
In this paper, we propose TAPTAP, a table pretraining method to empower models for tabular prediction.It can be combined with various backbone models and boost them via synthesizing highquality tabular data.A large-scale empirical study demonstrates that TAPTAP can benefit different SOTA backbone models on four tabular prediction scenarios.In the future, we plan to extend TAPTAP to process tables with a large number of features.

Acknowledgement
We thank all the anonymous reviewers for their constructive feedback and insightful comments.Tianping Zhang, Shaowen Wang, and Jian Li are supported in part by the National Natural Science Foundation of China Grant 62161146004.

Limitations
The major limitation of TAPTAP is the scalability.While we enjoy the advantages of LMs, we also introduce the drawbacks of LMs.In practice, TAP-TAP usually requires more running time and GPU memory than other methods.Detailed comparison can be found in Appendix B.3.In addition, TAP-TAP can only process tabular data with less than 100 features due to the input length limitation that GPT can process (i.e., 1024 tokens).

Ethics Statement
In this paper, we collected and filtered out 450 publicly available tabular datasets to construct the pre-training corpus for TAPTAP.As these datasets have been reviewed by well-known machine learning platforms such as Kaggle, they should have no private information about individuals.However, we cannot confirm whether these datasets contain potential biases since the corpus contains millions of samples.For example, there may be tables that have the potential to wrongly associate recruitment result to gender.Also, since our model is pre-trained based on GPT, readers may be concerned that the synthetic tables generated by our model contain offensive content.On this point, we argue that one might not be worried too much since for categorical features, our model can be easily tuned to only generate the values that appear in the downstream table, which is relatively controllable.

A Datasets
We provide the urls of the public datasets in Table 7.These datasets are publicly available, and their license permits usage for research purposes.

B Additional Experiments
We use the L 1 distance for numerical features.For categorical features, we set the distance to be 0 for equal categories and 1 otherwise.We present the results of California Housing and HELOC in Figure 7 and 8.

B.2 Sampling Diversity
We employ the coverage score (Naeem et al., 2020) to quantitatively evaluate the sampling diversity of TAPTAP and baseline methods.The coverage refers to the proportion of actual records that contain at least one synthetic record within its manifold.A manifold is defined as a sphere surrounding the sample, with a radius of r determined by the distance between the sample and its k-th nearest neighbor.We present the averaged coverage score in Figure 6.

B.3 Running Time Comparison
We analyze the running time of TAPTAP, TAPTAPdistill, and baseline methods.The experiments are carried out on a single NVIDIA GeForce RTX 3090 with 24 GB RAM, 64 system RAM, and Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz with 16 cores.For the privacy protection setting, we present the running time of training/fine-tuning and sampling separately.We present the results of the Adult Income dataset in Table 8.For the missing value imputation setting, we present the running time of the California Housing dataset in Table 9.
We can see that TAPTAP and TAPTAP-distill requires more running time than most of the baseline methods.While we enjoy the benefits of leveraging LMs to achieve top performance, we also introduce the drawbacks of LMs in requiring more computational resources.However, there are important real-world applications such as healthcare or finance where achieving better performance outweighs saving computational time.In addition, the fine-tuning and sampling time can be reduced by using more computational resources.

B.4 Privacy Protection
Table 2, 13 and 14 show the performance of our method and baseline methods in privacy protection setting with LightGBM, MLP, and Transformer as the backbone.

B.5 Low Resource Regime
Table 15 and 16 show the performance of our method and baseline methods in low resource regime setting with MLP and Transformer as the backbone.Note that both low resource datasets and high resource datasets are presented in the table.

B.6 Missing Value Imputation
Table 17, 4 and 18 show the performance of our method and baseline methods in missing value imputation setting using MCAR mechanism with LightGBM, MLP, and Transformer as the backbone.Table 19, 20 and 21 show the performance of our method and baseline methods in missing value imputation setting using MAR mechanism with LightGBM, MLP, and Transformer as the backbone.MIWAE and HyperImpute fail on some datasets because one feature in the dataset contains too many missing values.For example, 96.9% of data points in the "weight" column in the Diabetes dataset are missing.However, the methods require at least one valid value for each training batch.

B.7 Imbalance Classification
Table 5 shows the performance of our method and baseline methods in the imbalance classification setting with LightGBM as the backbone.Smotebased methods fail on the Loan dataset because there are fewer than 10 minority class data, which results in the number of sampled data points being less than the number of neighbors (Chawla et al., 2002) required.

C Hyperparameters Optimization
We use optuna (Akiba et al., 2019) to tune the hyperparameters of our backbone models, i.e.Light-GBM, MLP, and Transformer.For each specific dataset and model, we first use the original data to tune the hyperparameters of the model.Then the set of hyperparameters are used throughout all the experiments of the dataset on all the methods for a fair comparison.

n_estimators = 1000
Other hyperparameters and the search space for tuning are in Table 10.

C.2 MLP
We follow the implementation in Gorishniy et al. (2022).We present the hyperparameters space for searching in Table 11.

Figure 3 :
Figure 3: The word cloud for the pre-training corpus.

Figure 4 :
Figure 4: Experimental results in the ablation study.The y-axis is the average metric values across all datasets in the privacy protection setting with LightGBM.

Figure 5 :
Figure 5: The influence of pre-training scale on the downstream performance.The value of each method is the average metric values across all datasets in the privacy protection setting with LightGBM.

Figure 7 :
Figure 7: Distance to closest record (DCR) distribution of the California Housing dataset."Original" denotes the DCR of the original test set with respect to the original train set.The experimental results illustrate that each method does not copy samples from the train set.

Figure 8 :
Figure 8: Distance to closest record (DCR) distribution of the HELOC dataset.

Table Pre -
training Previous Herzig et al., 2020;Yu et al., 2021;Liu et al., 2022;Andrejczuk et al., 2022)sks and can be divided into four lines(Dong et al., 2022): table question answering which outputs the answer for questions over tables(Yin et al., 2020;Herzig et al., 2020;Yu et al., 2021;Liu et al., 2022;Andrejczuk et al., 2022), table fact verification which verifies whether a hypothesis holds based on the given table (Eisenschlos et al. , 2020), table to text which generates textual descriptions from the given table (Gong et al., 2020; Xing and Wan, 2021) and table structure understanding which aims at identifying structural types in the given table

Occupation is __ Privacy Protection Low Resource Regime Missing Value Imputation Imbalanced Classification
Age is __ , Education is __ , Occupation is __ Age is __ , Education is __ , Occupation is __ Age is 28 , Education is Some-college , Occupation is __ Income is ≤ 50K, Age is __ , Education is __ , Figure 2: The illustration of our method.The TAPTAP model is firstly pre-trained on the pre-training corpus, and then fine-tuned on the downstream table.During both pre-training and fine-tuning, tables are serialized into sequences via textual encoding, and TAPTAP is trained to predict them token by token.During inference, TAPTAP is prompted to sample values for " " in data prompts, and the filled values build up a synthetic table.Finally, once the backbone model has yielded labels for the synthetic table, it can be used to strengthen the backbone model.

Table 1 :
Properties of benchmark datasets.

Table 2 :
The experimental results in privacy protection. .

Table 3 :
The experimental results in low resource regime."+Ori" means training with the original data."+Ori + Synthetic Data" means training with the original data plus the synthetic data.Below the backbone model is Transformer with piece-wise linear encoding.The full results on all datasets can be found in Table15 and 16.

=
{(x i , y i )}, we first fine-tune TAPTAP on it to generate synthetic tabular features {x ′ i }.Next, a backbone model F is trained to fit the original table D.

Table 4 :
The experimental results in missing value imputation."+M-Ori"means training with the original data processed by the MCAR mechanism."+M-Ori+ Synthetic Data" means training with the M-Ori data where the missing values are imputed by different models.Below the backbone model is MLP.Results using LightGBM and Transformer as backbone models can be found in Table17 and 18. Results with the MAR mechanism can be found in Appendix B.6. ✗ denotes the method cannot run successfully on the dataset due to too many missing values.

Table 5 :
Experimental results in imbalanced classification."I-Ori" is the imbalanced data.Below the backbone model is LightGBM.✗ denotes the method cannot run successfully on the dataset due to too few samples in the minority class.The metric is AUC.

Table 6 :
The comparison between TAPTAP and TAP-TAP with additional web tables for pre-training.

Table 16 :
The experimental results in low resource regime."+ Ori" means training with the original data."+ Ori + Synthetic Data" means training with the original data plus the synthetic data.Below the backbone model is Transformer with piece-wise linear encoding.