OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering

Zhengbao Jiang, Yi Mao, Pengcheng He, Graham Neubig, Weizhu Chen


Abstract
The information in tables can be an important complement to text, making table-based question answering (QA) systems of great value. The intrinsic complexity of handling tables often adds an extra burden to both model design and data annotation. In this paper, we aim to develop a simple table-based QA model with minimal annotation effort. Motivated by the fact that table-based QA requires both alignment between questions and tables and the ability to perform complicated reasoning over multiple table elements, we propose an omnivorous pretraining approach that consumes both natural and synthetic data to endow models with these respective abilities. Specifically, given freely available tables, we leverage retrieval to pair them with relevant natural sentences for mask-based pretraining, and synthesize NL questions by converting SQL sampled from tables for pretraining with a QA loss. We perform extensive experiments in both few-shot and full settings, and the results clearly demonstrate the superiority of our model OmniTab, with the best multitasking approach achieving an absolute gain of 16.2% and 2.7% in 128-shot and full settings respectively, also establishing a new state-of-the-art on WikiTableQuestions. Detailed ablations and analyses reveal different characteristics of natural and synthetic data, shedding light on future directions in omnivorous pretraining.
Anthology ID:
2022.naacl-main.68
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
932–942
Language:
URL:
https://aclanthology.org/2022.naacl-main.68
DOI:
10.18653/v1/2022.naacl-main.68
Bibkey:
Cite (ACL):
Zhengbao Jiang, Yi Mao, Pengcheng He, Graham Neubig, and Weizhu Chen. 2022. OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 932–942, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering (Jiang et al., NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-main.68.pdf
Video:
 https://aclanthology.org/2022.naacl-main.68.mp4
Code
 jzbjyb/omnitab
Data
Spider-RealisticWikiSQLWikiTableQuestions