TabPrompt: Graph-based Pre-training and Prompting for Few-shot Table Understanding

,


Introduction
Abundant tabular data resources are readily available nowadays (Cafarella et al., 2008).However, the extensive knowledge within these corpora remains largely untapped because most tables are designed to be "human-friendly" rather than "machine-friendly" (Dong et al., 2019b).Consequently, to extract knowledge stored within these vast tables, it is crucial to enable machines to comprehend the semantics of tabular data, referred to as Table Understanding (TU).Furthermore, TU is fundamental in numerous subsequent tasks, including Knowledge Base Augmentation (Bhagavatula et al., 2015) and Table QA (Zhang and Balog, 2020).
Early methods involved manual feature construction designed for specific table datasets (Chen and Cafarella, 2014) or utilizing CNN or LSTM architectures (Nishida et al., 2017).These methods demonstrate good performance only when there is an ample amount of manually labeled training data available.However, labeling tabular data requires high labor costs due to the inherent characteristics of tables, such as their two-dimensionality and flexible layouts.Consequently, these methods struggle to cope with the challenge posed by the scarcity of labeled tabular data.
More recently, researchers have explored the application of pre-trained language models to the task of TU (Wang et al., 2020b;Deng et al., 2020;Herzig et al., 2020).During the pre-training phase, the model learns optimal initialization parameters through self-supervised training on many unlabeled tabular data, enabling it to achieve excellent performance after fine-tuning downstream tasks.Although self-supervised pre-training reduces the burden of manual annotation to some extent, it still requires a substantial amount of labeled table data during the fine-tuning process.Thus, the existing "pre-training, fine-tuning" methods also remain insufficient in addressing the scarcity of labeled tabular data when performing TU tasks.In other words, existing methods struggle to perform well in few-shot TU.
On the other hand, tables are structured data that encompass not only textual content but also topological information related to their layout.However, the existing methods sometimes misunderstand the tabular semantics resulting from disregarding the topological information within tables.For example, many pre-training models (Herzig et al., 2020) flatten a table and concatenate its contents row by row into one-dimensional plain texts as the input of the encoder.These methods only focusing on textual content overlook the topological information of the table, as cells in the same column typically have semantic commonalities.As depicted in Fig. 1 (a), cells in the second column belong to the language category.
To address the above two challenges, we propose a new framework called TabPrompt.First, we incorporate soft prompts (Li and Liang, 2021) into the framework and devise a prompt-based learning method suitable for tabular data.Recently, the new paradigm of "prompt-based learning" has attracted extensive attention due to its remarkable performance in few-shot scenarios (Liu et al., 2021a).Prompt-based learning enables more effective knowledge transfer by making downstream TU tasks more compatible with the pre-training task.By leveraging it, TabPrompt is capable of effectively tackling the challenges posed by few-shot TU.Second, we employ a novel Graph Contrastive Learning (Graph CL) to encode tabular data during pre-training.GNNs are well-suited for processing data with topology (Kipf and Welling, 2016), making them an ideal choice for capturing the topological information of tabular data.In addition, to encode the intrinsic structure knowledge, CL is one the most effective and popular pretext tasks (Zhang et al., 2020).In this regard, TabPrompt introduces a graph constructing method that fully considers the internal relationship between cells, which enables TabPrompt to better learn the topological structure information of the table during pretraining.To evaluate TabPrompt, we conduct extensive experiments in few-shot scenarios on three public datasets by comparing TabPrompt against several strong baselines.The results of outperforming all the baselines demonstrate the effectiveness of TabPrompt in the few-shot scenarios.
The contributions of this paper are summarized in the following points: • We apply prompt-based learning to the TU task to tackle the scarcity of labeled tabular data.To the best of our knowledge, this is the first attempt to introduce prompt learning into the field of TU. • To obtain vector representations that incorporate the topological information of tables, we introduce a novel Graph CL method as the pretext task to pre-train the encoder GNN.• To evaluate TabPrompt, we conduct experiments on publicly available datasets, focusing on two specific few-shot TU sub-tasks.In both tasks, the results of TabPrompt outperforming all baselines underscore its superiority in handling few-shot TU scenarios.

RELATED WORKS
Graph Neural Network (GNN).GNN architectures, such as GCN (Kipf and Welling, 2016) and GIN (Xu et al., 2018), have gathered substantial interest among researchers for their remarkable capability to handle real-world data containing inherent topological structure.Recognizing that tables inherently embody topological information, several studies (Du et al., 2021;Wang et al., 2021) have explored the application of GNNs in tablerelated research fields.However, their methods of constructing the tabular graph fail to fully consider the topological relationships between different cell types, resulting in the misunderstanding of tabular semantics.
Pre-training and Fine-tuning.Since the introduction of BERT, the "pre-training and fine-tuning" paradigm has gained significant attention.Researchers have adapted this paradigm to TU by designing customized pretext tasks tailored to tabular data (Wang et al., 2020b;Iida et al., 2021;Deng et al., 2020).While these methods effectively leverage unlabeled data during pre-training, they still rely on a substantial amount of labeled data in the fine-tuning stage to achieve optimal performance.As a result, these methods struggle to handle fewshot TU scenarios where only a limited amount of labeled data is available.Additionally, these methods often flatten the table into a sequential input, disregarding the inherent topology of the table.This can lead to the loss of crucial topological information during the modeling process.
Other Deep Learning-based TU.In the early stages of table understanding (TU) research, the emphasis was primarily on Cell Entity Linking (Ibrahim et al., 2016;Hassanzadeh et al., 2015;Efthymiou et al., 2017;Bhagavatula et al., 2015).These methods often relied on pre-defined ontologies and external knowledge bases, which limited their versatility.However, more recent work has shifted towards addressing broader TU tasks such as Table Cell Classification (Ghasemi-Gol et al., 2019;Sun et al., 2021) and Table Type Classification (Eberius et al., 2015;Nishida et al., 2017).These methods typically utilize manual features or early neural architectures like LSTM, which need a large quantity of manually labeled data for training.Prompt-based Learning.The main idea of the new paradigm "pre-train, prompt, and predict" (Liu et al., 2021a) is to reformalize the target task to look more like the pretext task to use better what the model has already learned.An increasing number of novel prompt-based learning methods have been proposed, encompassing various forms of prompts, such as soft prompts (Li and Liang, 2021).However, it is worth noting that there is currently no research available on the development of prompts specifically designed for tabular data.

PROPOSED METHOD
In this section, we first introduce the preliminaries of TU sub-tasks relevant to our work and then describe TabPrompt in detail.In this section, we begin by introducing the preliminaries of TU subtasks relevant to our work.Subsequently, we present a comprehensive description of TabPrompt.

Preliminaries
Given a table where N is the number of rows, M is the number of columns, and c i,j is the cell located in the i th row and j th column.
Cell Type Classification (CTC).This sub-task of TU involves the identification of the cell type for each cell c i,j in a table T .CTC has been widely studied by many works in which different taxonomies of cell types were used (Wang et al., 2020b;Du et al., 2021;Sun et al., 2021).In our work, we adopt the taxonomy used in Dong et al. 2019a and expand it by including an additional type for a comprehensive comparison, as shown in Fig. 1.Definitions of four cell types are as follows: A value represents a basic unit describing the content within a table .A valueName serves as a summary unit of value cells in the same column.An index is utilized to index value cells.An indexName acts as a summary unit for index cells in the same column.

An Overview of TabPrompt
An overview of our framework is shown in Fig. 3 First, TabPrompt transforms the tabular data into graph data, taking into account the topological relationships between cells during the graph construction.Second, TabPrompt utilizes the tabular Graph CL as the pretext task to pre-train the encoder GNN.Lastly, TabPrompt is trained on a limited amount of labeled data to tune soft prompts.This is accomplished by reformulating the objectives of CTC and TTC, aiming to bridge the gap with the pre-trained objective.

Construct the Tabular Graph
Existing methods that utilize GNNs often consider adjacent table cells or cells with similar strings as neighboring nodes in the graph (Du et al., 2021;Wang et al., 2021).However, these methods fail to adequately capture the internal topological relationships between nodes of different types.In order to address this limitation, we adopt a more sophisticated method of constructing graph data.We establish connections between cells that exhibit specific topological relationships.This transformation method aligns with the fundamental assumption of our tabular Graph CL method, which posits that connected nodes should have closer relationships and display higher similarity in their vector representations.
According to our observations of tabular data, we identify several characteristics of table cells: 1) Cell pairs within the same column exhibit higher levels of relatedness compared to cell pairs across different columns.2) Cells within the header row of a table tend to have similar levels of dependencies.
3) Each individual cell within a merged cell is the same type.4) The string within the header cell often serves as a summary description for the nonheader cells in the same column.Based on these patterns, we establish links between cells in the following situations: 1) Adjacent cells within the same column.2) Adjacent cells within the header row.3) Cells split from merged cells.4) Cells within the header row and non-header cells within the same column.

Tabular Graph Pre-training
In this section, we initially introduce our tabular Graph CL employed for pre-training, followed by a detailed discussion of the pre-training process. ...

Sample Triplets
A Tabular Graph Sampling Sampling Tabular Graph Contrastive Learning.After tables are transformed into graphs, our graph CL tailored to tabular data can be carried out.The process of our tabular graph CL method is shown in Fig. 4. A tabular graph of table T can be defined as x ij consists of the output of the string in the cell c ij through BERT (Devlin et al., 2019) and some handcrafted features that are general on tabular data.We observe that the information carried by surrounding cells may be helpful in determining the identity of the central cell.For example, if a cell is surrounded by numeric cells, the cell is also likely to be numeric.We define B ij as the embedding of cell c ij combining with neighbors' information.The formula of B ij is as follows: where Note that B g is the vector representation of a whole tabular graph T g .The choice of readout function is flexible such as sum pooling, mean pooling, or concatenation.Our tabular Graph CL method follows the assumption that the distance of connected cell pairs should be closer than that of unconnected cell pairs in the embedding space.Formally, given a sample triplet of cells (c ab , c pos , c neg ) that (c ab , c pos ) ∈ E and (c ab , c neg ) / ∈ E, the vector distance between the former cell pair should be smaller than that of the latter as follows: where dis is the reciprocal of the cosine distance function.
Pre-Training.The strength of pre-trained models lies in their ability to leverage a vast amount of unlabeled data.Similarly, in our tabular Graph CL, we capitalize on its label-free nature by employing it as the pretext task for model pre-training.During this phase, the model learns to assign similar vector representations to two highly related nodes.After pre-training, we can determine whether there is a high degree of correlation between two cells by measuring the vector distance between them.Before initiating the pre-training process of the model, it is necessary to prepare the training data.
The training data comprises triples that consist of nodes along with their corresponding positive and negative nodes.Formally, given a tabular graph T g , we sample a pair of positive and negative cells for every cell in T g to form a triplet (c ij , c pos , c neg ) for CL.To construct a sample set S for pre-training, we sample triplets from every graph T g in the tabular graph set T G .We employ GNN as the pre-trained encoder with its parameters W i .The objective function of pre-training is as follows: where τ is a temperature hyperparameter often used in CL to adjust the distribution shape.
When pre-training ends, the encoder GNN is utilized to handle downstream TU tasks with the pre-trained parameters W i .

Tabular Graph Prompting
Prompting-based learning aims to bridge the significant gap between pre-training and downstream tasks, which allows the model to transfer the knowledge acquired during pre-training more effectively.
In this section, we introduce the process of reformulating two TU tasks to enhance their compatibility with the pre-training objective, as well as the details for tuning the soft prompts.Fig. 5 shows the process of prompting for CTC.Prompt Addition.In the prompting stage, a key step is to reformulate the raw input by designing the appropriate prompt addition (Liu et al., 2021a).For instance, in machine translation, the input "I love you." is reformulated as "English: I love you.French: [Z]" with [Z] denoting an answer slot.Nevertheless, applying such templates to tabular data is challenging due to the interdependence of text within adjacent cells.Therefore, we develop a customized prompt addition specifically designed for tabular data.
As mentioned above, our pre-trained model is inclined to give closer vector representations to highly related cells when presented with a table as input.Consequently, the machine can determine whether two nodes share the same type based on the distance between their embeddings.Following this idea, we create proxy cells for each cell type, with their vector representations calculated as the mean of the vector representations of the same type of nodes in the few-shot training set.For each cell in a table, the machine compares its vector representation with those of all proxy nodes to determine its type.Similar prompt addition is also employed for the TTC task, the added proxy node represents the type of the table, and its vector is calculated as the mean of the B Tg of tables of the same type.Prompt tuning.In the prompt-based training process, there is a potential risk of overfitting in fewshot experimental settings if all parameters are tuned (Dong et al., 2021).To mitigate this, we freeze the pre-trained parameters W i and introduce the soft prompt (Fang et al., 2022).This involves employing the learnable prompts p c and p t for the CTC and TTC tasks, respectively.We perform the element-wise multiplication between the vector of each node and the corresponding soft prompt.This operation can be seen as assigning weights to each element of the vector.By tuning the learnable soft prompts, the model can assign larger weights to elements that are more relevant to the specific tasks, thereby enhancing its performance.
Formally, given the CTC dataset D C t containing a labeled set of tabular graphs T formulas for training p c are as follows: where B y C u is the vector of the proxy node representing the table cell labeled as y C u .Similarly, the TTC dataset D T t contains a of labeled tabular graph T ′ g = (V, E, y T g ) where y T g is the label of its table type.The label set of our TTC task is denoted as Y T = {y T 0 , ..., y T 4 }.The formulas for training p t are as follows: where B y T u is the vector of the proxy node representing the table labeled as y T u .

EXPERIMENTS
In this section, we conduct experiments on public datasets to evaluate the performance of TabPrompt in both the CTC and TTC tasks.

Datasets
Although our method can be applied to tables from other documents (CSV sheets, PDF documents, etc.), we focus on web tabular data in this paper since they are easily accessible and easy to parse.We employ four datasets in our work.1) TabEL (Bhagavatula et al., 2015)  The taxonomies adopted in these datasets differ slightly from the taxonomy used in this paper.We comprehensively describe the correspondence between the various taxonomies and other details in the appendix A.

Baselines
We compare TabPrompt with strong baselines for CTC and TTC to verify the effectiveness of TabPrompt.
We employ the following baselines for CTC.TCC-Embd (Ghasemi-Gol et al., 2019) is a tabular cell classification method with pre-trained CBOW&Skip-gram cell embeddings (Mikolov et al., 2013).PSL (Sun et al., 2021) reformalizes the CTC task into block detection, which aims to detect the data blocks in the table.TabularNet (Du et al., 2021) utilizes a homogeneous graph constructed out of the WordNet (Fellbaum, 2000) knowledge base and adopts GCN as the encoder and LSTM.TableFormer (Yang et al., 2022) is guaranteed to be invariant to row and column order perturbations.Because it is specifically for Table QA and Table Verification, it means that it cannot handle the TU directly.To adapt it for TU, we treat these models as encoders that output embeddings of table cells and stack layers of downstream classifiers (including an MLP layer and a pooling layer) on top of the encoders.In that case, these methods can be compared with our method.FewTPT (Liu et al., 2022) is a Chinese tabular language model, thus, we pre-train it using 570k Wiki tables in English from scratch.While the "Table Classification" mentioned in FewTPT centres on identifying header domains within tables, distinct from the focus of TTC, we employ a similar processing methods as in the case of TableFormer to facilitate handling the TU addressed in this paper.TUTA (Wang et al., 2020b) is a pre-training model with tree-based transformers for TU, which achieves SOTA performance among existing methods.
We employ the following baselines for TTC.DWTC (Eberius et al., 2015) train a Random Forest model through manually engineered features.TabNet (Nishida et al., 2017) utilizes a hybrid deep neural network architecture of LSTM and CNN.TabVec (Ghasemi-Gol and Szekely, 2018) is an unsupervised method to embed tables based on table-level manual features.Additionally, when considering TTC, TableFormer and FewTPT can also be assessed by applying stacked downstream classifiers.In addition to CTC, TUTA (Wang et al., 2020b) also performs well on TTC.

Settings and parameters
The embedding of a node in a tabular graph is composed of semantic information and manual features.The embedding of semantic information is obtained by feeding the string of the node into Sentence-BERT (Reimers and Gurevych, 2019) and then performing dimensionality reduction.The manual features used in our work are introduced in Crestan and Pantel 2011, which have strong versatility on tabular data.We present details of the manual features in Appendix B. To be fair, the manual features used in TabPrompt are also used by all baselines.We employ the macro-F1 score and the standard deviation as the evaluation metric in our experiments, commonly used in other TU works.
In the pre-training phase, we employ a 3-layer GIN as the backbone and set the hidden dimension as 64.We randomly sample 5k tables from TabEL and utilize the tabular Graph CL method mentioned above as the pretext task to pre-train the parameters of the backbone.The details of hyper-parameters can be found in Appendix C.
Following a typical k-shot classification setting (Liu et al., 2021b;Zhou et al., 2019;Wang et al., 2020a), we generate a series of few-shot downstream tasks of CTC and TTC for model training, validation, and testing.
For CTC, we conduct this downstream task on two datasets, i.e., TURL and WebSheet.On each dataset, we generate ten z-shot Cell Type Classification tasks for training and validation.In each task, z tables are randomly sampled for both training and validation, where z ∈ {1, 3}.Additionally, we randomly generate 20 tables for testing from the remaining not sampled for training and validation.After conducting the experiments, we calculate the mean macro-F1 score of the ten testing results, along with the standard deviation.For TTC, following a similar process of CTC, we randomly generate 10 z-shot (z ∈ {3, 5}) table classification tasks (i.e., sample z tables per class in one task) from WCC and the remaining tables not sampled for testing.

Performance Evaluation
In this section, we analyze the results of experiments of CTC and TTC conducted on public datasets, respectively.CTC.Based on the results of the few-shot CTC presented in Table 1, we draw the following conclusions.Firstly, the superior performance of TabPrompt over all baselines underscores the effectiveness of our proposed framework.This result proves that prompt-based learning is well-suited for addressing few-shot TU scenarios.Secondly, despite TUTA having a higher number of parameters pre-trained with more data, TabPrompt achieves better performance.This finding highlights the significance of bridging the gap between pre-training and downstream tasks, as it enhances the effectiveness of knowledge transfer from pre-training to downstream tasks.Thirdly, TabPrompt outperforms TabularNet, indicating that considering the topological information within the table significantly aids the model in better comprehending the table's semantics.Fourthly, TableFormer and FewTPT exhibit inferior performance compared to TUTA despite all being pre-trained.This is mainly because neither are designed for the TU task, leading to the absence of modules tailored to TU tasks.TTC.We present the results of few-shot TTC in Table 2. First, the consistently superior performance of TabPrompt once again demonstrates the effectiveness of our proposed framework.Second, as both CTC and TTC share the same parameters of the pre-trained model, the superior performance of GraphPrompt on both types of tasks further supports the notion that our tabular Graph CL enables the models to learn better vector representations.
Performance with different shots.From Table 1 and Table 2, it is evident that the performance of all models improves as the number of samples increases.Therefore, we study the trends in performance with varied shots.We have Method A, Method B, and our method tested with different shots on the TURL dataset in Fig 6 .The graph shows that in the case of low shots, TabPrompt consistently outperforms other methods.Although TabPrompt is surpassed when it comes to 50 shots (typically beyond 600 table cells), the amount of data is beyond the scope of our target scenario.

Ablation study
To analyze the individual contribution of each component in TabPrompt, we conduct the following ablation study: • w/o cl: When constructing graph data without tuning to train the model in downstream tasks.
Results of ablation experiments are shown in Table 3.It is evident that TabPrompt's performance significantly deteriorates when using a vanilla method to construct graph data.This can be attributed to the fact that the vanilla method dramatically diminishes the effectiveness of tabular Graph CL.It confirms that considering the topological information within tables aids machines in comprehending the semantics of tables more effectively.
In our previous analysis, we hypothesize that the soft prompt can assign more weight to distinguish features during the prompting process.We observe that when the parameters of the soft prompt are not tuned, the performance of TabPrompt experiences a slight decrease.This observation confirms our hypothesis.
When we replace the prompt tuning method with fine-tuning, we observe a decrease in the model's performance across all three datasets.This outcome further reinforces that prompt tuning can effectively enhance the model's performance in scenarios with limited training tabular data.

Case Study
To facilitate an intuitive interpretation, we present a typical case in the CTC testing data that demonstrates the functionality of tabular Graph CL.Fig. 7 shows two classification results for a table.Fig. 7 (a) represents a correct CTC result, whereas the classification results in Fig. 7 (b) have some flaws.It highlights that misunderstanding may result from neglecting the topological structure information within the table, such as the cells in one column tending to be a category as well as that cells in the header rows being more likely to be a valueName rather than a value.For instance, the misclassified cell "July 1780 Typhoon" should have a stronger relationship with the upper and lower cells, while the relationship with the right cell "1780" is relatively weak despite both containing the textual content of "1780".

CONCLUSION
In this paper, we proposed a new framework TabPrompt.To tackle the scarcity of labeled tabular data and capture the untapped topological information, we resorted to prompt-based learning and Graph CL, respectively.The experimental results of outperforming all baselines demonstrate the effectiveness of TabPrompt in few-shot TU.
In future work, we aim to extend our method to handle domain-specific tables and more complex table structures, further enhancing its capabilities.In addition, we introduce a multimodal technique that leverages both text and image data within the table to enhance the understanding of its semantics.

Limitations
First, the input of TabPrompt must be a machineparsable data format, such as HTML files with <table> tags or CSV files.TabPrompt does not encompass the processes of locating or extracting tables from images or original web pages.Second, TabPrompt cannot handle tables with overly complex layouts, such as tables with images or too many empty cells.Third, TabPrompt performs poorly on small-sized tables due to the limited amount of text information available within these tables.A Correspondence with Different Taxonomies and More Details In this section, we comprehensively describe the correspondence between the various taxonomies.
The annotation taxonomy employed in TURL focuses on annotating the subject column within tables.For instance, in Fig. 1 (a), the subject column of the table is the second column titled "language".The correspondence between the taxonomy of TURL and ours is illustrated in Fig. 8.The taxonomy of this WebSheet is nearly identical to ours, except that they do not explicitly annotate unlabeled cells with value.Regarding the WCC dataset, we classify the tables categorized as Matrix into the category of Relational tables.This decision is made due to the absence of clear boundaries between these two types, as they share similar cell types.
To ensure data quality, we applied filters to remove certain tables.This includes tables with irregular layouts, where rows do not have the same number of cells as other rows, as well as tables with pictures, huge tables, and repeated tables.The filtering process was implemented through some rules.

B Manual Features
The main manual features used in our work are listed in Fig 9.

C Hyper-parameters
The semantic features of nodes are first output by Roberta and then reduced dimensions to 512 through PCA.We employ a 3-layer GIN architecture as the backbone whose hidden dimension is set For CTC the number of row the current cell is located in the number of column the current cell is located in whether the string of the current cell is unique in the current column the string type of the current cell whether the current cell is from the merged cell For TTC the number of rows of the current table the number of columns of the current table as 1024.The activation function of GIN is ReLu.For pre-training, we set the learning rate as 0.01, the weight decay as 1e-5, and the dropout as 0.5.We set the readout function as mean pooling.

Figure 6 :
Figure 6: Impact of varied shots on CTC.

Figure 7 :
Figure 7: A real case for CTC.
Table Type Classification.Examples of table types from WCC.