CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL

Mayank Kothyari, Dhruva Dhingra, Sunita Sarawagi, Soumen Chakrabarti


Abstract
Existing Text-to-SQL generators require the entire schema to be encoded with the user text. This is expensive or impractical for large databases with tens of thousands of columns. Standard dense retrieval techniques are inadequate for schema subsetting of a large structured database, where the correct semantics of retrieval demands that we rank sets of schema elements rather than individual documents. In response, we propose a two-stage process for effective coverage during retrieval. First, we use an LLM to hallucinate a minimal DB schema that it deems adequate to answer the query. We use the hallucinated schema to retrieve a subset of the actual schema, by composing the results from multiple dense retrievals. Remarkably, hallucination — generally considered a nuisance — turns out to be actually useful as a bridging mechanism. Since no existing benchmarks exist for schema subsetting on large databases, we introduce two benchmarks: (1) A semi-synthetic dataset of 4502 schema elements, by taking a union of schema on the well-known SPIDER dataset, and (2) A real-life benchmark called SocialDB sourced from an actual large data warehouse comprising of 17844 schema elements. We show that our method leads to significantly higher recall than SOTA retrieval-based augmentation methods.
Anthology ID:
2023.emnlp-main.868
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14054–14066
Language:
URL:
https://aclanthology.org/2023.emnlp-main.868
DOI:
10.18653/v1/2023.emnlp-main.868
Bibkey:
Cite (ACL):
Mayank Kothyari, Dhruva Dhingra, Sunita Sarawagi, and Soumen Chakrabarti. 2023. CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14054–14066, Singapore. Association for Computational Linguistics.
Cite (Informal):
CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL (Kothyari et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-main.868.pdf
Video:
 https://aclanthology.org/2023.emnlp-main.868.mp4