Automatic Table Union Search with Tabular Representation Learning

Xuming Hu; Shen Wang; Xiao Qin; Chuan Lei; Zhengyuan Shen; Christos Faloutsos; Asterios Katsifodimos; George Karypis; Lijie Wen; Philip S. Yu

doi:10.18653/v1/2023.findings-acl.233

Automatic Table Union Search with Tabular Representation Learning

Xuming Hu, Shen Wang, Xiao Qin, Chuan Lei, Zhengyuan Shen, Christos Faloutsos, Asterios Katsifodimos, George Karypis, Lijie Wen, Philip S. Yu

Abstract

Given a data lake of tabular data as well as a query table, how can we retrieve all the tables in the data lake that can be unioned with the query table? Table union search constitutes an essential task in data discovery and preparation as it enables data scientists to navigate massive open data repositories. Existing methods identify uniability based on column representations (word surface forms or token embeddings) and column relation represented by column representation similarity. However, the semantic similarity obtained between column representations is often insufficient to reveal latent relational features to describe the column relation between pair of columns and not robust to the table noise. To address these issues, in this paper, we propose a multi-stage self-supervised table union search framework called AutoTUS, which represents column relation as a vector– column relational representation and learn column relational representation in a multi-stage manner that can better describe column relation for unionability prediction. In particular, the large language model powered contextualized column relation encoder is updated by adaptive clustering and pseudo label classification iteratively so that the better column relational representation can be learned. Moreover, to improve the robustness of the model against table noises, we propose table noise generator to add table noise to the training table data. Experiments on real-world datasets as well as synthetic test set augmented with table noise show that AutoTUS achieves 5.2% performance gain over the SOTA baseline.

Anthology ID:: 2023.findings-acl.233
Volume:: Findings of the Association for Computational Linguistics: ACL 2023
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3786–3800
Language:
URL:: https://aclanthology.org/2023.findings-acl.233/
DOI:: 10.18653/v1/2023.findings-acl.233
Bibkey:
Cite (ACL):: Xuming Hu, Shen Wang, Xiao Qin, Chuan Lei, Zhengyuan Shen, Christos Faloutsos, Asterios Katsifodimos, George Karypis, Lijie Wen, and Philip S. Yu. 2023. Automatic Table Union Search with Tabular Representation Learning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3786–3800, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Automatic Table Union Search with Tabular Representation Learning (Hu et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-acl.233.pdf

PDF Cite Search Fix data