KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers

The goal of database question answering is to enable natural language querying of real-life relational databases in diverse application domains. Recently, large-scale datasets such as Spider and WikiSQL facilitated novel modeling techniques for text-to-SQL parsing, improving zero-shot generalization to unseen databases. In this work, we examine the challenges that still prevent these techniques from practical deployment. First, we present KaggleDBQA, a new cross-domain evaluation dataset of real Web databases, with domain-specific data types, original formatting, and unrestricted questions. Second, we re-examine the choice of evaluation tasks for text-to-SQL parsers as applied in real-life settings. Finally, we augment our in-domain evaluation task with database documentation, a naturally occurring source of implicit domain knowledge. We show that KaggleDBQA presents a challenge to state-of-the-art zero-shot parsers but a more realistic evaluation setting and creative use of associated database documentation boosts their accuracy by over 13.2%, doubling their performance.


Introduction
Text-to-SQL parsing is a form of database question answering (DBQA) that answers a user's natural-language (NL) question by converting it into a SQL query over a given relational database. It can facilitate NL-based interfaces for arbitrary enduser applications, thereby removing the need for domain-specific UX or learning query languages. As such, DBQA attracted significant attention in academia and industry, with development of supervised datasets (Yu et al., 2018), large-scale models (Wang et al., 2020b;Zeng et al., 2020), and novel modeling techniques Deng et al., 2020).
The key challenge of text-to-SQL parsing is zeroshot generalization to unseen domains, i.e. to new database schemas and differently distributed NL questions. Large-scale annotated datasets like Spider (Yu et al., 2018) and WikiSQL (Zhong et al., 2017) evaluate cross-domain generalization of textto-SQL parsers by restricting overlap between train and test domains. Such challenging benchmarks facilitate rapid progress in DBQA. State-of-the-art (SOTA) accuracy on Spider rose from 12.4% to 70.5% in just two years since its release, demonstrating the value of well-chosen evaluation settings.
Despite impressive progress in DBQA, deployment of SOTA parsers is still challenging. They often lack robustness necessary to deploy on reallife application domains. While many challenges underlie the gap between SOTA DBQA and its reallife deployment, we identify three specific discrepancies.
First, Spider and WikiSQL datasets normalize and preprocess database schemas or rely on academic example databases that originate with humanreadable schemas (Suhr et al., 2020). In contrast, industrial databases feature abbreviated and obscure naming of table, columns, and data values, often accrued from legacy development or migrations. Figure 1 shows a characteristic example. After deployment, text-to-SQL parsers struggle with schema linking to domain-specific entities because they do not match the distribution seen in their pre-training (e.g. BERT) or supervised training (e.g. Spider).
Second, the NL questions of Spider and Wik-iSQL have high column mention percentage (Deng et al., 2020), which makes their language unrealistic. This can be an artifact of rule-generated NL templates (as in WikiSQL) or annotation UIs that prime the annotators toward the schema (as in Spider). Either way, real-world deployment of a text-to-SQL parser optimized on Spider faces a distribution shift in NL, which reduces its realistic performance.
Finally, the standard evaluation setting of crossdomain text-to-SQL parsing assumes no in-domain Database: Student Math Score Federal revenue through the state -Child Nutrition A • To test database generalization, it includes realworld databases from Kaggle, 2 a platform for data science competitions and dataset distribution. They feature abbreviated and obscure column names, domain-specific categorical values, and minimal preprocessing (Section 3.1).
• To test question generalization, we collected unrestricted NL questions over the databases in KaggleDBQA. Importantly, the annotators were not presented with original column names, and given no task priming (Section 3.2). Out of 400 collected questions, one-third were out of scope for SOTA text-to-SQL parsers. The remaining 272 questions, while expressible, can only be solved to 13.56% accuracy (Section 4).
• Finally, we augment KaggleDBQA with database documentation, common metadata for real-world databases and a rich source of implicit domain knowledge. Database documentation includes column and table descriptions, categorical value descriptions (known as data dictionaries), SQL examples, and more (Section 3.3). We present a technique to augment SOTA parsers with column and value descriptions, which significantly improves their out-of-domain accuracy (Section 4). Figure 1 shows a representative example from the dataset. Aligning "federal revenue" and t_fed_rev is hard without domain knowledge.
In addition to more realistic data and questions, we argue that evaluation of real-world text-to-SQL performance should assume few-shot access to ∼10 in-domain question-SQL examples rather than measuring zero-shot performance. In practical terms, few-shot evaluation assumes up to 1-2 hours of effort by a target database administrator or application developer, and translates to significant performance benefits. In a few-shot evaluation setting, augmenting a SOTA text-to-SQL parser (RAT-SQL by Wang et al. (2020b)) with database documentation almost doubled its performance from 13.56% to 26.77%. See Section 4.

Related Work
Text-to-SQL Semantic Parsing Semantic parsing has been studied extensively for decades (Liang, 2016). Key in-domain datasets such as Geo-Query (Zelle and Mooney, 1996) and ATIS (Dahl et al., 1994) acted as initial catalyst for the field by providing an evaluation measure and a training set for learned models. Applying a system to a domain with a different distribution of questions or parses required out-of-domain data or domain transfer techniques. Recently, cross-domain datasets WikiSQL (Zhong et al., 2017) and Spider (Yu et al., 2018) proposed a zero-shot evaluation methodology that required out-of-domain generalization to unseen database domains. This inspired rapid development of domain-conditioned parsers that work "out of the box" such as RAT-SQL (Wang et al., 2020b) and IRNet (Guo et al., 2019). We use the same exact match accuracy metric as these works. Recent work (Zhong et al., 2020) has proposed evaluating SQL prediction via semantic accuracy by computing denotation accuracy on automatically generated databases instead.
Few-shot learning In this paper, we propose a few-shot evaluation to inspire future research of practical text-to-SQL parsers. Like zero-shot, fewshot has access to many out-of-domain examples, but it also has access to a small number of indomain examples as well. Few-shot learning has been applied to text classification in (Mukherjee and Awadallah, 2020), and has also been applied to semantic parsing. Common techniques include meta-learning (Huang et al., 2018;Wang et al., 2020a;Li et al., 2021;Sun et al., 2020) and adversarial learning .

Generalization and Practical usability
Recent work has begun to question whether existing datasets are constructed in a way that will lead to models that generalize well to new domains. Suhr et al. (2020) identified a number of challenges with text-to-SQL datasets, one of which is an artificially high overlap between words in a question and words in the tables. This issue appears in Spider and is a byproduct of the fact that question authors view the database schema as they write their question. The Spider-Realistic (Deng et al., 2020) dataset aims to reduce this by explicitly rewriting the questions to avoid overlapping terms. Other works has studied the problem of the gap between academic datasets and their practical usability (de Vries et al., 2020;Radhakrishnan et al., 2020;Zhang et al., 2020), including highlighting the need for data to be real. Our goal was to create an evaluation dataset and metric that minimizes this gap; our dataset is constructed from real data found on Kaggle that has been used for competitions or other analyses.
Another direction of generalization being explored is compositionality. Keysers et al. (2020) used rules to generate a large-scale semantic parsing dataset that specifically tests models for composability. Rastogi et al. (2020) provide NL descriptions for slots and intents to help dialogue state tracking. Logeswaran et al. (2019) use descriptions to facilitate zero-shot learning for entity linking. Weller et al. (2020) use descriptions to develop a system that can perform zero-shot learning on new tasks. We follow by including documentation on each included real-world database. Notably, this documentation was written for human consumption of the database rather than prepared for KaggleDBQA, and thus is a natural source of domain knowledge. It provides similar benefits to codebase documentation and comments, which improve source code encoding for AI-assisted software engineering tasks (Panthaplackel et al., 2020;.

KaggleDBQA: A Real World Dataset
The goal of the KaggleDBQA evaluation dataset is to more closely reflect the data and questions a text-to-SQL parser might encounter in a real-world setting. As such, it expands upon contemporary cross-domain text-to-SQL datasets in three key aspects: (i) its databases are pulled from real-world data sources and not normalized; (ii) its questions are authored in environments that mimic natural question answering; (iii) its evaluation assumes the type of system augmentation and tuning that could be expected from domain experts that execute text-to-SQL parser deployment. We describe each of these components in turn in this section.

Database Collection
We chose to obtain databases from Kaggle, a popular platform for hosting data science competitions and sharing datasets and code. Their hosted datasets are by definition "real" as they are used by members of the site for research. Competition hosts upload their data unnormalized, and the data content and formatting matches its domainspecific usage (see Figure 1 for an example). To construct KaggleDBQA, we randomly selected 8 Kaggle datasets that satisfied the following criteria: (a) contained a SQLite database; (b) licensed under a republishing-permissive license; (c) had associated documentation that described the meaning of the tables and columns.

Questions
For each database, we asked five annotators to write ten domain-specific questions that they think someone might be interested in and that can be answered using the database. We use five annotators per database to help guarantee diversity of questions. Each annotated two databases, for a total of 20 annotators and 400 questions. The annotators are not required to possess SQL knowledge so their questions are more reflective of natural user interests. Importantly, to discourage users from using the same terms from the database schema in their questions, we replace the original column names with the column descriptions. When annotating the questions, the annotators are shown a paragraph description of the database, table names, column descriptions and ten sampled rows for each table. We do not provide any constraints or templates other than asking them to avoid using exact phrases from the column headings in the questions. Appendix A.2.3 shows the full guidelines.
Separately, each question is annotated with its SQL equivalent by independent SQL experts. They are given full access to all of the data content and database schema. One-third of the questions were yes/no, percentage, temporal, or unexpressible in SQL and were not considered in our evaluation of SOTA models (see Appendix A.2.2 for details), leaving 272 questions in total.

Database Documentation
Each database has associated plain-text documentation that can assist text-to-SQL parsing. It is commonly found as internal documentation for database administrators or external documentation accompanying a dataset release. The contents vary but often contain an overview of the database domain, descriptions of tables and columns, sample queries, original sources, and more.
While all of these types of information could be leveraged to assist with domain transfer, in this work we focus on the column descriptions. They help address the schema linking problem of textto-SQL parsing, i.e. aligning entity references in the question with database columns (Wang et al., 2020b). For example, "federal revenue" in Figure 1 must be aligned to the column t_fed_rev even though its abbreviated name makes alignment non-obvious.
We manually extract the column descriptions from the database documentation and provide the mapping from column to description as part of KaggleDBQA. The descriptions are free text and sometimes contain additional information such as defining the values in an categorical column. Such information could help with the value-linking problem (mapping a value in the question to the column that likely contains it). We leave the entire description as a single field and leave it to future work to explore these uses further. In addition to column descriptions, we also include the original unstructured documentation which can be used for future research on automatically extracting descriptions or leveraging other domain knowledge.

Few-shot Evaluation Setting
The current cross-domain datasets Spider (Yu et al., 2018) and WikiSQL (Zhong et al., 2017) evaluate models in a zero-shot setting, meaning the model is trained on one set of domains and evaluated on a completely disjoint set. This evaluation encourages the development of systems that work well "out of the box" and has spurred great development in cross-domain text-to-SQL systems that are able to generalize to new domains. However, we believe the zero-shot setting is overly-restrictive compared to how text-to-SQL systems are likely to be actually used in practice. We postulate that it is more realistic to assume a setting where an application author spends 1-2 hours authoring examples and adapting existing database documentation. This time investment is a small fraction of the time required to prepare an application itself and so we believe application authors would devote the time if it resulted in increased text-to-SQL accuracy. In informal experiments, we have found SQL annotators can author 10-20 examples in an hour. Thus, the KaggleDBQA evaluation setting is few-shot: 30% of the questions for each domain (6-15 depending on the domain) are designated as in-domain and may be used as part of training for that domain, along with documentation. The remaining 70% are used for evaluation.
We report accuracy in both the few-shot as well as the standard zero-shot (cross-domain) setting in this paper, but consider the few-shot setting to be the primary evaluation setting for KaggleDBQA. Evaluation is conducted on the same 70% portion regardless of setting, to ensure comparable results.  Figure 2: Comparisons of text-to-SQL datasets in terms of SQL structure hardness. KaggleDBQA has more complex SQL query structure than the Spider dev set.

Dataset Statistics and Comparison
We compare KaggleDBQA with previous benchmark datasets using key metrics in Table 1.
KaggleDBQA has the lowest value mention percentage among all datasets, and also exhibits a low overlap between question terms and column names similar to that in all of the datasets besides Spider, making it more in line with what would be expected in a real-world setting where the people asking questions are not familiar with the actual database schema and terminology. This is likely a result of replacing column names with descriptions in the question annotation task.
We also analyze the overlap between question terms and column descriptions in Table 2. Because the descriptions are significantly longer than column names, we require only that they share an ngram in common (ignoring stop-words) rather than requiring exact match as was done for column mention percent. Unigram overlap is reasonably high (56% of correct columns match the question) but also results in many false-positive matches with other columns. Increasing n-gram size decreases false-positives but also rapidly decreases the correct column match percent. Thus, column descriptions may help guide the model, but are not as strong of a signal as found in Spider which suffers from high exact column name match overlap. This was our intention in asking our annotators to avoid using the descriptions verbatim when writing questions.
To measure the complexity of SQL in KaggleDBQA, we adopt the hardness criteria of Spider and report the numbers in Figure 2. The queries are on average more complex than Spider's, with significantly more hard and extra-hard ones.

Baseline Results
We first evaluate KaggleDBQA using models that were developed for the Spider dataset.
EditSQL (Zhang et al., 2019): EditSQL (with BERT) is the highest-performing model on the Spider dataset that also provides an open-source implementation along with a downloadable trained model. 3 The model was built for edit-based multiturn parsing tasks, but can also be used as a singleturn parser for Spider or KaggleDBQA. It employs a sequence-to-sequence model with a questiontable co-attention encoder for schema encoding.

RAT-SQL (Wang et al., 2020b):
RAT-SQL (v3 + BERT) is the model with highest accuracy on the Spider leaderboard that also provides an opensource implementation. 4,5 It adds string matching to the encoder through the use of relation-aware self-attention and adopts a tree-based decoder to ensure the correctness of the generated SQL.
Throughout this paper, we use the same exactmatch accuracy metric introduced by the Spider dataset. Although our primary evaluation setting is few-shot, we first examine the traditional zeroshot setting to present an unbiased comparison with previous results. Table 3 compares the performance of these two models (both trained on Spider). As can be seen, the performance of both models is significantly lower on KaggleDBQA. This echoes the findings of Suhr et al. (2020) who found that a model trained on Spider did not generalize well to other datasets. Also, KaggleDBQA has much fewer column mentions and much more complex SQL than Spider (see Table 1 and Figure 2).
For all further experiments on KaggleDBQA that emulate real-world evaluation, we choose RAT-SQL as the best performing parser.

Moving to the Few-Shot Setting
To apply RAT-SQL to KaggleDBQA's few-shot setting, for each domain we create a model by fine-tuning on its 30% in-domain data. See Appendix A.3 for implementation details. This fine-3 https://github.com/ryanzhumich/ editsql 4 As of one month before paper authoring. Current SOTA systems are also based on RAT-SQL and add less than 5% accuracy, thus will likely behave similarly. 5 https://github.com/microsoft/rat-sql tuning is always performed as the last step before evaluation.
As Table 4 shows, fine-tuning on a small amount of in-domain data dramatically increases overall accuracy from 13.56% to 17.96% (rows (a) and (e)), Although the few-shot setting is our primary setting, we also present results in the zero-shot setting to compare to previous work (Table 4 rows (e)-(h)). However, in the remainder of the paper we will be focusing on the few-shot setting.

Leveraging Database Documentation
The database schemas in KaggleDBQA are obscure, making the task difficult without leveraging the database documentation. We consider only the column descriptions, but other portions of the documentation may prove useful in future work. The best approach for incorporating column descriptions into a text-to-SQL model is model-specific. RAT-SQL makes use of relations between question tokens and schema terms to assist with schemalinking. We extend the same functionality to column descriptions by appending the column descriptions to the column names (separated by a period) and recomputing matching relations. The concatenated column name is also presented to the transformer encoder for schema encoding.
Simply adding these descriptions results in mismatch between the training set (Spider) which does not have descriptions, and the evaluation set (KaggleDBQA) which does. To alleviate it, we first augment the schemas in Spider with artificial descriptions. For column of table , the description for is "the of the ". We then retrain RAT-SQL on Spider with these artificial descriptions.
Since the artificial descriptions simply restate information from the schema, the model may not learn to leverage them for any further information about schema linking and simply treat them as noise. Therefore, we also evaluate RAT-SQL adapted to the general domain of KaggleDBQA so that it (a) As with the other few-shot results, the model is then fine-tuned on the few examples of target domain data. Adaptation and fine-tuning are two separate training processes. Adaptation is meant to adapt to the real-world distribution. Fine-tuning is meant to adjust for in-domain knowledge. The most effective setting for a target database in our experiments is to conduct adaptation first, followed by fine-tuning.
Table 4 (row (d)) shows the results. Using column descriptions in the context of adaptation increases model accuracy from 17.96% to 26.77%. Ablations show that adaptation and descriptions each contribute approximately half of this gain (row (c)). Descriptions provide no benefit without adaptation (row (b)), likely due to the train-test mismatch between artificial descriptions and real ones. With- out any artificial descriptions, accuracy drops even further so they are critical to leveraging in-domain knowledge. Overall, incorporating in-domain data (i.e. a few-shot setting and database documentation) nearly doubles model accuracy from 13.56% to 26.77% on KaggleDBQA.

Column Normalization
One of the major challenges in KaggleDBQA is that column names are often obscure or abbreviated. A natural question is whether this creates difficulty because the model struggles to understand the meaning of a column or because it leads to a low overlap between question and column terms. In an attempt to tease these factors apart, we created a normalized version of KaggleDBQA by replacing the obscure column names with normalized column names such as one might find in the Spider dataset. This was done manually using column descriptions to help clarify each column and without introducing any extra knowledge into the column names except for the expansion of abbreviations (e.g. t_fed_rev → total federal revenue).
In Table 5 we give the results of evaluation on the normalized KaggleDBQA, following the same setup as Table 4. Normalization provides a significant boost in performance (row (c) vs. row (a)). The trend is similar to Table 4. Without adaptation, models with descriptions are not better than those without (row (b) vs. row (a), row (d) vs. row (c)). After adaptation, the train-test mismatch is partly mitigated and the performance improves (row (f) vs. row (e), row (h) vs. row (g)). Normalization and descriptions provide complementary knowledge augmentation, jointly improving accuracy by 5% (row (h) vs. row (e)), more than either alone. Normalization helps clarify the obscure column names of KaggleDBQA. However, the other chal- Table 6: Examples where description-augmented ("desc.") models solve a question that unaugmented models ("no desc.") do not. Both models are adapted and fine-tuned. Both omit values, as per the official Spider metric.  lenges such as low column mention percentage and in-domain schema conventions still leave significant room for improvement. We provide the full experimental results on normalized tables in the Appendix. Table 6 shows examples of improvements due to descriptions. First, column descriptions help the parser correctly identify columns to select. For instance, it chooses STAT_CAUSE_CODE over STAT_CAUSE_DESCR when asked for "the most common cause of the fire (code)". Second, they clarify necessary constraints. For instance, when asked "how many samples come from other countries?", the parser chooses the correct origin column rather than superficially-matching country in the clause WHERE sampledata15.origin = "2". Table 7 shows a distribution of error types in KaggleDBQA using 10 randomly-selected erroneous predictions for each domain. The error categories mostly follow Suhr et al. (2020), modulo (a) removing unobserved categories, (b) separat-ing semantically equivalent predictions into their own "Equivalent" category, and (c) categorizing significant structural errors as "Understanding Errors". We also provide more characteristics of each database in Table 8 in an attempt to understand the difference in performance across databases. Our model performs worst on the databases with the most columns (Pesticide, Baseball and Soccer). The only database with lower accuracy is Math-Score which has multiple tables and a relatively small fine-tuning set.

Error Analysis
The most common error types and their examples are summarized in Table 9. (i) The most common type is "Incorrect Final Column" (33.75%), illustrating the difficulty of schema linking in KaggleDBQA even with documentation and finetuning. (ii) 32.5% of the errors are in "Missing Constraints". In KaggleDBQA questions, users sometimes use implications instead of directly mentioning the desired constraint, e.g. "in preparation" for Status = "Under Construction". (iii) 31.25% of the errors are in "Incorrect Constraint", e.g. failing to parse "highest" into the top-1 result in  descending order. (iv) 15% of the errors are in "Entity-column matching", e.g. aligning "Salford" to Location rather than LSOA. This illustrates the difficulty of value linking, partly mitigated by value descriptions for categorical columns in the database documentation.

Conclusion & Future Work
KaggleDBQA provides two resources to facilitate real-world applications of text-to-SQL parsing. First, it encourages an evaluation regime that bridges the gap between academic and industrial settings, leveraging in-domain knowledge and more realistic database distribution. We encourage adopting this regime for established text-to-SQL benchmarks. Second, it is a new dataset of more realistic databases and questions, present-ing a challenge to state-of-the-art parsers. Despite the addition of domain knowledge in the form of database documentation, our baselines reach only 26.77% accuracy, struggling to generalize to harder questions. We hope that better use of documentation and new modeling and domain adaptation techniques will help further advance state of the art. The KaggleDBQA dataset is available at https://aka.ms/KaggleDBQA.

Ethical Considerations
Dataset Collection The data collection process was pre-approved by IRB. Each annotator agreed to a consent form before having access to the labeling task. Each annotator was rewarded with a $20 e-gift card for the approximately one hour of their time. The authors of this paper acted as the SQL an-notators and incurred no additional compensation. The databases collected for KaggleDBQA were individually reviewed to ensure they were properly licensed for re-distribution. For other details of dataset construction, please refer to Section 3. Aside from email addresses, no personal information of annotators was collected during our study. Email addresses were not shared and were promptly deleted after compensation had been provided. The association between annotator and annotation was deleted before any analysis or distribution was conducted.
Language Distribution KaggleDBQA only includes question annotations and databases in English, thus evaluating multi-lingual text-to-SQL models on it will require translation. The set of annotators included both native and second-language speakers of English, all fluent.

Usage of DBQA Technology Our goal with
KaggleDBQA is to encourage the development of DBQA that will work in real-world settings. The actual deployment of a text-to-SQL parser must be conducted with appropriate safeguards in place to ensure users understand that the answers may be incorrect, especially if those answers are to be used in decision making.   For adaptation and fine-tuning, we decrease the learning rate of BERT parameters by 50 times to 6e-8 to avoid overfitting. We keep the learning rate of non-BERT parameters the same at 7.44e-4. We also increase the dropout rate of the transformers from 0.1 to 0.3 to provide further regularization.