Automatic Table Union Search with Tabular Representation Learning

Given a data lake of tabular data as well as a query table, how can we retrieve all the tables in the data lake that can be unioned with the query table? Table union search constitutes an essential task in data discovery and preparation as it enables data scientists to navigate massive open data repositories. Existing methods identify uniability based on column representations (word surface forms or token embeddings) and column relation represented by column representation similarity. However, the semantic similarity obtained between column representations is often insufficient to reveal latent relational features to describe the column relation between pair of columns and not robust to the table noise. To address these issues, in this paper, we propose a multi-stage self-supervised table union search framework called A UTO TUS , which represents column relation as a vector– column relational representation and learn column relational representation in a multi-stage manner that can better describe column relation for table unionability prediction. In particular, the large language model powered contextualized column relation encoder is updated by adaptive clustering and pseudo label classification iteratively so that the better column relational representation can be learned. Moreover, to improve the robustness of the model against table noises, we


Introduction
The growing availability of tabular data from academia and industry brings new opportunities for economic growth and social benefit, which has also attracted considerable attentions in the natural language processing (NLP) community.A number of work has been done, such as Table QA (Chen et al., 2020;Zhu et al., 2021), Table fact verification (Aly et al., 2021;Guo et al., 2022), Table summarization (Xi et al., 2023), Table understanding (Yang et al., 2022;Liu et al., 2022aLiu et al., , 2023)), etc.More recently, table union search (Nargesian et al., 2018;Bogatu et al., 2020;Khatiwada and Fan, 2023) is proposed to facilitate tabular dataset discovery applications (Brickley et al., 2019;Galhotra and Khurana, 2020;Santos et al., 2022), which can benefit the tabular data integration and analysis.The table union search aims to find all tables in a data lake that have the columns from the same domain as the query table.With the help of the table union search, the understanding of the tabular data is largely improved and it can potentially benefit other tabular-focused NLP downstream tasks.Therefore, table union search is a non-trival NLP research problem.
Recent literature shows that column representation methods are useful for table union search.The basic idea is to map the column to a latent vector space (e.g.column representations (Bogatu et al., 2020;Chepurko et al., 2020) or token embeddings (Zhang and Balog, 2017;Herzig et al., 2020;Fan et al., 2022;Yang et al., 2022)) and then compute the table unionability score based on the column relation obtained from the similarity scores of col-umn pairs between query table and target tables.However, these methods only consider the column representation and their similarity as the column relation to compute the table unionability score, which is not enough to describe the relationship between two columns and may cause incorrect prediction.Figure 1 shows an example of the table union search with a query table and a unionable table.By looking at the second columns (in red) of two tables, the computed column similarity score (a scalar) from column embedding is low due to the dissimilar token embeddings.In this case, the column relation is not intuitive, taking only the similarity between column embeddings to describe the column relation is not enough to predict the table unionability accurately.However, if we combine each column representation and model them as an embedding (a vector), the new representation (column relational representation) is more informative and can better describe the unintuitive column relation to make the table unionability prediction more accurately.Therefore, a better approach is needed, which can make the table unionability prediction via the column relational representation.
On the other hand, real-world table noises widely exist (Koutras et al., 2021;Liu et al., 2022b).Taking column "CLG" in the query table and column "POS" in the data lake table as examples, both columns have abbreviated column names, tokens are repeated in column values (e.g., "UCLA" becomes "UCUCLA") and proximal characters in keyboard are replaced (e.g., "Shooting Guard" becomes 'Shootung Guard").Computing column representations or column relational representations based on token embeddings with tabular noise introduces unavoidable wrong predictions.Therefore, it requires the table union search model to be robust to the table noise during optimization.
To address above issues, we propose a multistage self-supervised table union search framework called AUTOTUS, which represents column relation as a vector-column relational representation and learn column relational representation in a multi-stage manner that can better describe column relation for unionability prediction.In particular, the large language model powered contextualized column relation encoder is updated by adaptive clustering and pseudo label classification iteratively so that the better column relational representation can be learned.Moreover, to improve the robustness of the model against

Problem Definition
Table union search discovers all tables in a set of data lake tables that are unionable with the query table.Specifically, we focus on the automatically case where no labeled training data is available.We define a set of data lake tables as There is a column relation e ij ∈ E for each pairs of columns t i , t j that describes their relationship.We employ a contextualized column relational encoder to obtain the column relational representation z e ij ∈ Z for each column relation e ij .The table unionability score u (T i ,T j ) ∈ U is obtained by the table scorer that goes over all column relational representations between two tables T i , T j .Similar to previous studies (Nargesian et al., 2018;Bogatu et al., 2020;Khatiwada and Fan, 2023), given a query table Q, table union search retrieves the top-y unionable tables Q ⊆ T , where |Q| = y and ∀T ∈ Q and

Proposed Method
In this section, we introduce the proposed AU-TOTUS.The framework (as shown in Figure 2) consists of following components: Multi-stage Self-supervised Column Relational Representation   Learning, Table Noise Generation and Table Unionability Scorer.We first illustrate the motivation for AUTOTUS: The existing column representation methods use cosine similarity to measure the unionability of column pairs.The column encoder in these methods is usually only trained with a binary classification task, i.e. whether two columns are unionable.In AUTOTUS, the table encoder can benefit from an additional task: the identification of hidden semantic types (domains, task knowledge, etc.) from the column pairs.In the absence of explicit type labels, we use adaptive clustering on column pairs and learn those hidden types.The classification step is then used to update the table encoder, so that the embeddings are generated in a way that minimizes the hidden type prediction errors.In short, the clustering and classification modules learn the hidden unionable semantic types, resulting in a better table encoder for the table union search task.We leverage a large language model, BERT (Devlin et al., 2019), to effectively encode column relations, along with their context tabular information.

Multi
The pre-trained large language models (such as BERT) support an input length of up to 512 tokens (Devlin et al., 2019).However, columns in a real-world table may contain thousands of tokens, and randomly discarding tokens will lead to loss of important semantics.Therefore, in this work, we score the importance of each token or cell in a column, and keep cells based on their importance.More specifically, we adopt TF-IDF score to calculate the token importance through its inverse document frequency: log(|Y|)/{t| token ∈ t, t ∈ Y}, where t is a column and |Y| is the number of all data lake columns.Then the cell score is calculated by averaging the TF-IDF of all tokens within the cell.After obtaining the importance score of each column-level cell, the row alignment can ensure the correctness of the table semantics, e.g.: Jimmy Butler's birthplace is aligned to the United States instead of the United Kingdom.Therefore, we cal-culate the average score of cells in each row and sort them in descending order.Next, we select rows until the total token count reaches the input budget and obtain the table : As shown in Figure 2, we serialize the table by column and follow the marking schema adopted in Soares et al. ( 2019) to augment T with two reserved tokens to mark the start and the end of each column.We introduce ⟨s⟩ and ⟨/s⟩ and inject them into T : as the input token sequence for the encoder.
We denote the Contextualized Column Relation Encoder as f θ (T, ⟨s⟩, ⟨/s⟩).Instead of using the output of [CLS] token from BERT itself which summarizes the whole table-level semantics, we get the contextualize column relation semantics by concatenating the representations ⟨s⟩ of any two columns in different tables or in the same table, and derive a fixed-length representation (2) Column Relation Encoder Updater via Adaptive Clustering The Column Relation Encoder Updater via Adaptive Clustering encourages each low-confidence relation label distribution to be closer to the high-confidence cluster centroids, improves the confidence of the clusters, and optimizes Contextualized Column Relation Encoder by relaxconstraints to generate better column relational representations.In addition, the cluster label generated for each column relation could be treated as self-supervision signal, which serves as the pseudo label for language model update.More specifically, after obtaining N column relational representations , and softly assigns all N column relation representations to K clusters.In practice, we first adopt standard k-means clustering among Inspired by Maaten and Hinton (2008); Hu et al. (2020), we leverage the Student's t-distribution as the kernel to measure the similarity of the embedding point z n to each centroid ξ k as: where q nk can be viewed as the probability of assigning column relation embedding z n to the cluster k as the soft assignment and α denotes the degree of the Student's t-distribution.We set α = 1 for all experiments.We normalize each cluster by adopting the frequency as an auxiliary target distribution proposed by Xie et al. (2016) in Equation 4and iteratively refine each cluster with the help of the auxiliary target distribution: where With the auxiliary target distribution, we could encourage each cluster to learn from its high confidence cluster assignments and simultaneously alleviating the bias caused by imbalanced clusters.We define KL divergence loss between the soft assignments q n and the auxiliary distribution p n to train the Contextualized Column Relation Encoder as follows: We adopt the label associated with the largest probability as the pseudo label µ n for the n-th column relation: where τ ϕ denotes the column relation classification module with parameters ϕ and l n is the probability distribution for the n-th sample over K pseudo labels.To find the best performance parameters θ for Contextualized Column Relation Encoder and ϕ for the classifier, we optimize the classification loss: where loss is the cross entropy loss and one_hot(µ n ) returns a one-hot pseudo label assignment vector.

Table Noise Generator
The For each table T in the data lake tables T , we randomly apply above noise to the training tabular data to improve the robustness of our model against the table noise.

Table Unionability Scorer
The Table Scorer obtains the table unionability score based on the column relational representations between column relations in the query table and data lake tables, which can provide more sufficient information to clarify whether the two tables come from the same domain and have a union relation.After we obtained the Contextualized Column Relation Encoder with discriminative power to get the column relation embeddings z e ij = (c ⟨st i ⟩ , c ⟨st j ⟩ ), we adopt the Column Relation Encoder Updater via Pseudo Label Classification to assign the K relational labels to all N column relations z n = (z e ij ) n in the two tables: and assign the majority label κ as the predict relational label for the query table and data lake table.
We count the number of majority labels κ in the all N column relational representations between two tables as M , so the table unionability score is M/N .This score no longer depends on the calculation of cosine similarity between two column features to obtain the column unionability score, but obtains table unionability score through the column unionability scores.The higher the score, the greater the probability of these two tables being unionable.Instead of computational methods with insufficient column relational information, AUTO-TUS directly model the complex column relational representations, which can better address the table union search task.

Experimental Evaluation
We conduct extensive experiments on real-world public datasets as well as a synthetic test set augmented with table noise to show the effectiveness of our AUTOTUS on table union search task, and give a detailed analysis.

Experimental Setups
Datasets.To evaluate our technique, three public datasets are used : SANTOS (Khatiwada and Fan, 2023), TUS small, and TUS large (Nargesian et al., 2018).These three datasets are subsets of Open Data released by authorities such as US Open Data1 , UK Open Data2 , and Canada Open Data3 with manually labeled data.However, these data are usually officially filtered to remove a lot of table noise to maintain high quality.In order to verify the robustness of the model, we synthesized a test set with table noise, we added both column value and column name noises to the tabular data and obtained the synthesized SANTOS N and TUS large N .Note that although we add table noises, the human-annotated labels remain unchanged.To avoid information leakage, the manually labeled dataset is not used during the training.We randomly generated the same number of unlabeled training tables as the labeled tables from Open Data (Koutras et al., 2021;Khatiwada and Fan, 2023).We give the detailed statistics of datasets for three benchmarks in Table 1.We also give the dataset construction details in Appendix A.
Experimental Settings and Metrics.We adopt BERT-Base, BERT-Large (Devlin et al., 2019) and RoBERTa-Large (Liu et al., 2019) as the encoder and set max-length as 512.For Column Relation Encoder Updater via Adaptive Clustering, we adopt k-means and set K = 50 to get the initial centroids.We stop the clustering and classification loop when the current pseudo labels differs by less than 10% from the previous epoch.For Column Relation Encoder Updater via Pseudo Label Classification, we adopt the fully connected layer as τ ϕ and set dropout rate as 10%, learning rate as 1e−5 and warmup as 0.1.To allow fully connected layer to warm up, we fixed the parameters in f θ for the first two epoches.We add one noise of column data or column schemata to each column to obtain the training dataset.We first randomly choose whether to add noise in the column data or column schemata, and secondly, for all replacement, addition, or deletion of noise at the token level, we select 20% of all tokens in the cell of the column for modification.
For evaluation metrics, following the previous work (Bogatu et al., 2020;Khatiwada and Fan, 2023), we adopt the Mean Average Precision at Y (M AP @Y ) and Recall at Y (R@Y ) to evaluate the effectiveness of the searched top-Y table results.Note that M AP @Y is the average value of Precision at y (P @y), where y = 1, 2, • • • , Y .Formally, given the query table Q and a set of data lake tables T , we define T Q as the set of unionable tables based on the ground truth and T ′ Q as the set of top-Y unionable table results using the searching methods.The P @Y and R@Y are calculated as: Note that perfect R@Y is not possible when the ground truth contains less than Y searched table results.We define the Mean Average Precision M AP @Y as: For a fair comparison, we adopt the Y = 10 on the SANTOS benchmarks and Y = 60 on the TUS benchmarks to be consistent with Nargesian et al. (2018); Khatiwada and Fan (2023).
Baseline Models.We compare AUTOTUS with baseline models in two categories.The first category is table union search model that adopts various column features to calculate the table unionability score.We adopt D 3 L (Bogatu et al., 2020), SATO (Zhang et al., 2020), Starmie (Fan et al., 2022), and SANTOS (Khatiwada and Fan, 2023) as baselines.The second category is table pretrained model that leverages unlabeled tables for Methods

Results and Analysis
Overall Performance.In addition, we study the performance changes brought about by using larger base encoders BERT-Large (Devlin et al., 2019) and RoBERTa-Large (Liu et al., 2019).
(a) P@K on SANTOSN (c) P@K on TUS Small (e) P@K on TUS LargeN From Figure 2, it can be concluded that leveraging a large-scale pre-trained language model with more parameters can lead to better MAP and R, BERT-Large and RoBERTa-Large obtain an average boost of 1.1% and 1.4%, respectively.Another interesting finding is that, although all the models have decreased performance on the datasets with table noise, the average performance improvement of AUTOTUS compared to SOTA is consistent, and even increased to 6.6% on MAP and 3.8% on R. We attribute the improvement of AUTOTUS to the noisy, weak column relational representations from the large pre-trained language models are exploited and refined: we bootstrap the representations obtained from noisy tabular data via self-supervised training schema.More specifically, we give the comparison of P@Y and R@Y as Y changes in Figure 3.We find that AUTOTUS consistently outperforms all baseline models on both metrics P@Y and R@Y as Y changes.
Ablation Study.We conduct ablation studies to show the effectiveness of different modules of AU-TOTUS.A number of variants of AUTOTUS are considered: AUTOTUS w/o The general conclusion from ablation study results shown in Table 2 is that the all three modules contribute positively to improve the performance.More specifically, without table noise in training data, AUTOTUS w/o Table Noise Generation gives 2.3% less MAP and R averaged on all datasets, especially the noisy SANTOS N and TUS large N , the drop reaches 3.5%.If we stop exploiting self-supervised signals for column relational features learning, AUTOTUS w/o Column Relation Encoder Updater via Pseudo Label Classification brings 12.8% less MAP and R averaged on all datasets.This huge performance drop fully demonstrates the importance of learning and refining column-pair representations.Column Relation Encoder Updater via Adaptive Clustering gives 3.0% MAP and R boost in average when comparing with the hard-assignment alternative (AUTOTUS w/o Column Relation Encoder Updater via Adaptive Clustering).
Visualizing Contextualized Column Relational Embeddings.To intuitively show how clusteringenhanced self-supervised training can exploit selfsupervised signals to obtain better contextualized column relational representations, we visualize the dimensionally reduced column relational representation space R 2•h R using t-SNE (Maaten and Hinton, 2008).We randomly choose 10 base tables from TUS large N dataset and sample all column relations that belong to the corresponding base table.We show the visualization results in Figure 4 with each column relation being colored according to their ground-truth.
From Figure 4, we can see that AUTOTUS w/o Column Relation Encoder Updater via Pseudo Label Classification can assign meaningful semantics to column relations from different base tables, but these unrefined features cannot be tailored for the table union search task.When Column Relation Encoder Updater via Adaptive Clustering is not applied and only k-means are used, AUTOTUS w/o Column Relation Encoder Updater via Adaptive Clustering performs a hard assignment of features, and the clustering results appear messy since there are no cluster centroids with high confidence.AU-TOTUS w/o Table Noise Generation demonstrates that the clustered features is not tolerate to the noise samples, resulting an unclear cluster boundaries.AUTOTUS shows denser and well-separated clusters, which verifies its powerful column relational representation learning ability.
Parameter Analysis: when K is unknown.The Column Relation Encoder Updater via Adaptive Clustering provide the flexibility to explore column relational features without knowing any prior information about the number of clusters.This property is attractive when the number of target clusters is agnostic.We vary K from 10 to 200 and report the  MAP@60 score in Figure 5, the best result is obtained when K = 25 for TUS Small and K = 50 for TUS Large, slightly larger than the number of base tables in the two datasets, indicating that AUTOTUS actually exploits the number of target clusters as useful prior knowledge.Thanks to the self-supervised training for column relational features exploitation and the flexibility brought by soft-assignment clustering, when we vary K from 10 to 200, AUTOTUS gets a more stable MAP@60 score than AUTOTUS w/o Column Relation Encoder Updater via Adaptive Clustering.
Impact of Table Noise.We investigate the effect of table noise in column value and column name.As shown in Section 3.2, we randomly select a corresponding noise from two types of noises.From Table 3, we observe that adding table noise to both column value and column name help to improve the performance, whereas changing column name has a slightly larger effect (2.8% vs. 1.5%).This improvements may related to the fact that column name summarizes the entire column semantically, so adding noise to the column names can make the model more robust to table noise and thus obtain more performance gains.However, adding more noise did not consistently improve performance, suggesting that the model needs to find a tradeoff between noisy data perturbation and increased robustness to noise.
5 Related Work

Table Union Search
In dataset discovery, it is crucial to find related tables in data lakes.The table union search task has recently received considerable attentions (Ling et al., 2013;Lehmberg and Bizer, 2017;Khatiwada and Fan, 2023).The initial attempt is made by Nargesian et al. (2018).The D 3 L (Bogatu et al., 2020) categories columns into groups by column features.Starmie (Fan et al., 2022) obtains the column features by utilizing a contextualized pretrained language models.SANTOS (Khatiwada and Fan, 2023) leverages a knowledge base to discover the unionable relations between two tables.More recently, a number of table pre-training models have be developed to obtain the contextualized table representation (Yin et al., 2020;Gong et al., 2020;Dong et al., 2022).However, these methods are not tailored to the table union search task.

Self-supervised Learning in NLP
Self-supervised learning (SSL) is a method for building models where the output labels are already included in the input data, eliminating the need for additional labeled data (Liu et al., 2021;Hu et al., 2021a,b;Liu et al., 2022d,c).SSL has been widely used in NLP domains such as sentence generation (West et al., 2019;Yan et al., 2021), document processing (You et al., 2021;Ginzburg et al., 2021), natural language inference (Li et al., 2022(Li et al., , 2023)), and text reasoning (Klein and Nabi, 2020;Fu et al., 2020;Chen et al., 2022).BERT (Devlin et al., 2019) is one of the most eminent SSL methods which exploit self-supervisions from corpus with next sentence prediction and masked language modeling tasks.In our work, we adopt SSL method to exploit and refine self-supervised signals from tabular data.

Conclusions
In

Limitations
We would like to claim our limitations from two perspectives: technical-wise and application-wise.Technical-wise: We currently only experiment with BERT-Base, BERT-Large, and RoBERTa-Large as the basic encoders.For larger language models, due to limited resources, we have not implemented them.
Application-wise: The experimental data comes from the Open Data repository released by governments of various countries.Although many domains are covered, some domain-specific data, such as biomedical, have not been considered.Furthermore, our tabular data are all from English, open data research in other languages can be considered as a future research direction.

A Dataset Construction Details
For test data, the SANTOS Small benchmark (Khatiwada and Fan, 2023) consists of 550 data lake tables, which are generated from 296 open datasets from Canada, the United Kingdom, the United States, and Australia.The SANTOS Small also has 50 query tables.There are two benchmarks accessible from Nargesian et al. (2018): TUS Small and TUS Large.The TUS Small consists of 1,530 data lake tables that are generated from 10 base table of Canada open data.The TUS Small also has 150 query tables.The TUS Large consists of 5,043 data lake tables that are generated from 32 base table of Canada open data.The TUS Large also has 100 query tables.The SANTOS4 and TUS5 benchmarks and their unionable table ground truth are publicly available.For training data, we randomly generated the same number of unlabeled training tables as the labeled tables from base tables of Open Data.Note that the public code for generating tables is publicly available6 (Koutras et al., 2021;Khatiwada and Fan, 2023).Since table union search task is an unsupervised task, no dev data is obtained.

B The Introduction of Baseline Models
We compare AUTOTUS with two categories of baseline models.The first category of models leverage unlabeled tables for self-supervised pretraining and achieve promising results in the table understanding tasks as baseline encoders to calculate the column unionability score: (1) TaBERT (Yin et al., 2020) is a pretrained language model that simultaneously learns representations for (semi-)structured tables and natural language phrases.26 million tables and their English contexts make up the vast corpus on which TaBERT was trained.
(2) TABBIE (Iida et al., 2021) develops a straightforward pretraining target (corrupt cell identification) that only learns from tabular data and achieves the state-of-the-art on the table-based tasks.TABBIE offers embeddings of all table substructures (cells, rows, and columns), unlike competing techniques, and it also takes far less computing power to train.
(3) TUTA (Wang et al., 2021) is a unified pretraining architecture for comprehending typically arranged tables.TUTA improves transformers with three structure-aware techniques after realizing that understanding a table necessitates spatial, hierarchical, and semantic information.
(4) FORTAP (Cheng et al., 2022) explores to leverage rhe spreadsheet formulas for table pretraining.FORTAP adopts two self-supervised pretraining objectives, which are derived from formulas, numerical reference prediction and numerical calculation prediction.
(5) TableFormer (Yang et al., 2022) is a structurally conscious table-text encoding architecture in which learnable attention biases are used to fully include tabular structural biases.
Of course, there are more baseline models in this category, we just select representative and SOTA models.The second category of models adopt various column representations to calculate the column unionability score: (6) D 3 L (Bogatu et al., 2020) creates hash-based indexes using the features of the items in a dataset and maps those features to a uniform distance space.
(7) SATO (Zhang et al., 2020) is a hybrid machine learning model that uses both the context of the table and the values of the columns to automatically identify the semantic categories of columns in tables.
(8) Starmie (Fan et al., 2022) obtains the semantic information included inside tables by utilizing a contrastive multi-column pre-training technique.
(9) SANTOS (Khatiwada and Fan, 2023) suggests a definition of unionability that takes connections between columns and their semantics into principled consideration.

Figure 1 :
Figure 1: The table union search example on Open Data.

Figure 3 :
Figure 3: P@Y and R@Y results on different datasets.

Figure 4 :
Figure 4: Visualizing column relational representations after t-SNE dimension reduction on TUS large N .
table noises, we pro-pose table noise generator to add table noise to the training table data.Extensive experiments are conducted on three real-world datasets and show significant performance gains compared to existing state-of-the-art methods and demonstrates the robustness of AUTOTUS against the table noise.Our contributions are four-folded: 1) We propose a novel framework AUTOTUS that can leverage column relational representation for table union search instead of pairwise column similarity.2) We propose a multi-stage self-supervised method that can update the large language model powered contextualized column relation encoder in a step by step and iterative way to learn better column relational representation.3) We propose a table noise generator and add table noise to the training tabular data to improve the robustness of the model against the table noise.4) We conduct extensive experiment on real-world datasets as well as synthetic test set augmented with table noise and demonstrate significant performance gain of AUTOTUS over the strong baselines and shows the robustness of AUTOTUS against the table noise.

Table A
table noise generator module aims to improve the robustness of the model against the table noise.We firstly summarizes the noise types that exist in the real-world tabular data, and introduce these table noises to the column relations during the self-cimputer / coimputer.2) Delete / repeat characters, e.g., computer → compuer / compputer.3) Change the numeral display format like scientific notation, e.g., 12000 → 1.2e4.Here are some representative column name noises: 1) Prefix column names with table name, e.g., travel_destination supervised training.Specifically, we defined two types of table noise: column value noise an column name noise.Here are some representative column value noises: 1) Replace / insert characters with proximal characters, e.g., computer →

Table 1 :
The statistics of datasets for three benchmarks.

Table 2 :
MAP@Y and R@Y comparisons (%).Results of AUTOTUS are averaged over five runs.† means we replace the base encoder from RoBERTa-Base to BERT-Base, and all models are based on BERT-Base for a fair comparison.Note that perfect R@Y is not possible when the ground truth contains less than Y searched tables.
(Yang et al., 2022))training and achieves promising results in various table related NLP tasks.We adopt TaBERT(Yin et al., 2020), TABBIE(Iida et al., 2021), TUTA(Wang et al., 2021), FORTAP(Cheng et al., 2022), and TableFormer(Yang et al., 2022)as baseline encoders to calculate the column unionability score and adopt our Table Scorer module to obtain the table unionability score.Although, there are more baseline models in this category, we just select representative and SOTA ones.The details of baselines are introduced in Appendix B.

Table 2
AUTOTUS exploit the column relational representations and are able to beat all the table union search baselines.
Table Noise Generation removes added table noise in the training data; AUTOTUS w/o Column Relation Encoder Updater via Pseudo Label Classification is AUTOTUS without Column Relation Encoder Updater via Pseudo Label Classification and only uses the Contextualized Column Relation Encoder for Column Relation Encoder Updater via Adaptive Clustering; AUTOTUS w/o Column Relation Encoder Updater via Adaptive Clustering replaces the suggested softassignment clustering techniques with k-means as a hard-assignment alternative.