TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data

Pengcheng Yin, Graham Neubig, Wen-tau Yih, Sebastian Riedel


Abstract
Recent years have witnessed the burgeoning of pretrained language models (LMs) for text-based natural language (NL) understanding tasks. Such models are typically trained on free-form NL text, hence may not be suitable for tasks like semantic parsing over structured data, which require reasoning over both free-form NL questions and structured tabular data (e.g., database tables). In this paper we present TaBERT, a pretrained LM that jointly learns representations for NL sentences and (semi-)structured tables. TaBERT is trained on a large corpus of 26 million tables and their English contexts. In experiments, neural semantic parsers using TaBERT as feature representation layers achieve new best results on the challenging weakly-supervised semantic parsing benchmark WikiTableQuestions, while performing competitively on the text-to-SQL dataset Spider.
Anthology ID:
2020.acl-main.745
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8413–8426
Language:
URL:
https://aclanthology.org/2020.acl-main.745
DOI:
10.18653/v1/2020.acl-main.745
Bibkey:
Cite (ACL):
Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8413–8426, Online. Association for Computational Linguistics.
Cite (Informal):
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data (Yin et al., ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.745.pdf
Video:
 http://slideslive.com/38929345
Code
 facebookresearch/tabert
Data
SPIDERWikiTableQuestions