Jiarong Jiang
2023
Importance of Synthesizing High-quality Data for Text-to-SQL Parsing
Yiqun Hu
|
Yiyun Zhao
|
Jiarong Jiang
|
Wuwei Lan
|
Henghui Zhu
|
Anuj Chauhan
|
Alexander Hanbo Li
|
Lin Pan
|
Jun Wang
|
Chung-Wei Hang
|
Sheng Zhang
|
Jiang Guo
|
Mingwen Dong
|
Joseph Lilien
|
Patrick Ng
|
Zhiguo Wang
|
Vittorio Castelli
|
Bing Xiang
Findings of the Association for Computational Linguistics: ACL 2023
There has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed three shortcomings: illogical synthetic SQL queries from independent column sampling, arbitrary table joins, and language gaps between the synthesized SQL and natural language question (NLQ) pair. To address these issues, we propose a novel synthesis framework that imposes strong typing constraints, incorporates key relationships from schema, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated NLQ. When existing powerful text-to-SQL parsers are pretrained on our high-quality synthesized data, these models have significant accuracy boosts and achieve new state-of-the-art performance on Spider. We also demonstrate the effectiveness of our techniques with ablation studies
2022
Improving Text-to-SQL Semantic Parsing with Fine-grained Query Understanding
Jun Wang
|
Patrick Ng
|
Alexander Hanbo Li
|
Jiarong Jiang
|
Zhiguo Wang
|
Bing Xiang
|
Ramesh Nallapati
|
Sudipta Sengupta
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
Most recent research on Text-to-SQL semantic parsing relies on either parser itself or simple heuristic based approach to understand natural language query (NLQ). When synthesizing a SQL query, there is no explicit semantic information of NLQ available to the parser which leads to undesirable generalization performance. In addition, without lexical-level fine-grained query understanding, linking between query and database can only rely on fuzzy string match which leads to suboptimal performance in real applications. In view of this, in this paper we present a general-purpose, modular neural semantic parsing framework that is based on token-level fine-grained query understanding. Our framework consists of three modules: named entity recognizer (NER), neural entity linker (NEL) and neural semantic parser (NSP). By jointly modeling query and database, NER model analyzes user intents and identifies entities in the query. NEL model links typed entities to schema and cell values in database. Parser model leverages available semantic information and linking results and synthesizes tree-structured SQL queries based on dynamically generated grammar. Experiments on SQUALL, a newly released semantic parsing dataset, show that we can achieve 56.8% execution accuracy on WikiTableQuestions (WTQ) test set, which outperforms the state-of-the-art model by 2.7%.
Search
Fix data
Co-authors
- Alexander Hanbo Li 2
- Patrick Ng 2
- Jun Wang (王军) 2
- Zhiguo Wang 2
- Bing Xiang 2
- show all...