Exploring Underexplored Limitations of Cross-Domain Text-to-SQL Generalization

Recently, there has been significant progress in studying neural networks for translating text descriptions into SQL queries under the zero-shot cross-domain setting. Despite achieving good performance on some public benchmarks, we observe that existing text-to-SQL models do not generalize when facing domain knowledge that does not frequently appear in the training data, which may render the worse prediction performance for unseen domains. In this work, we investigate the robustness of text-to-SQL models when the questions require rarely observed domain knowledge. In particular, we define five types of domain knowledge and introduce Spider-DK (DK is the abbreviation of domain knowledge), a human-curated dataset based on the Spider benchmark for text-to-SQL translation. NL questions in Spider-DK are selected from Spider, and we modify some samples by adding domain knowledge that reflects real-world question paraphrases. We demonstrate that the prediction accuracy dramatically drops on samples that require such domain knowledge, even if the domain knowledge appears in the training set, and the model provides the correct predictions for related training samples.

show that the generalization performance is much worse in more challenging scenarios. For example, Deng et al. (2021) investigate the cases when the explicit mentions of database columns are removed from the question. Similarly, (Gan et al., 2021) observe that the model accuracy dramatically drops by replacing schema-related words with some synonyms. On the other hand, Suhr et al. (2020) find that the generalization to other databases is much worse, due to the distribution shift of both questions and SQL queries. These papers introduce important challenges for improving the generalization performance, i.e., the model trained on a cross-domain text-to-SQL dataset (e.g., Spider (Yu et al., 2018)) does not generalize to a new external database. However, the performance degradation is somehow expected for the following reasons. First, removing the explicit mentions breaks the assumptions that make the schema linking effective. Second, SQL queries in other databases could come from a different distribution; e.g., according to the hardness criteria defined by Spider benchmark, over 40% Spider SQL queries are Medium hardness, but there are less than 10% Medium SQL queries in the GeoQuery dataset (Zelle and Mooney, 1996).
In this work, we demonstrate that the generalization performance could be poor even when both the NL questions and SQL queries follow the similar distribution to the training set. Specifically, we constructed Spider-DK, a challenging variant of the Spider development set, with the focus of evaluating the model understanding of domain knowledge. A domain means a certain type of application scenarios; for example, the Spider benchmark includes various distinct domains such as geography and university. Cross-domain text-to-SQL research aims to build a text-to-SQL model that can generate correct SQL queries and generalize to different domains. Therefore, one main challenge of crossdomain text-to-SQL generalization is to understand different knowledge required by different domains.

T1
SELECT Columns Mentioned by Omission NL Find the name of the teacher who ... SQL select firstname , lastname from ...

T2
Simple Inference Required NL ... order of their date of birth from old to young. SQL ... order by date_of_birth asc

T3
Synonyms Substitution in Cell Value Word NL List the state in the US ... SQL ... where billing_country = "USA" ...

T4
One Non-Cell Value Word Generate a Condition NL How many students got accepted after the tryout? SQL ... from tryout where decision="yes"

T5
Easy to Conflict with other Domains NL ... with max speed higher than 1000. SQL ... where max_speed > 1000 For example, the university domain usually needs the knowledge of different job titles and genders, while the geography domain emphasizes more on the knowledge of places instead of people. We show that the state-of-the-art models consistently fail in cases when specific domain knowledge is required for prediction, even if the domain knowledge is moderately mentioned in the training data, and the models accurately predict the corresponding training samples. Such discrepancy suggests that the models do not properly learn the domain knowledge in order to fit the training set, thus improving the model capability to capture the domain knowledge is an important direction towards achieving the cross-domain generalization for textto-SQL applications. To our knowledge, we are the first work investigating the text-to-SQL model capability of understanding the domain knowledge provided in the training set, and generalizing the knowledge to new problems.

Overview
We construct the Spider-DK benchmark by selecting samples from the Spider development set that require domain knowledge understanding, and we also manually modify some samples to incorporate domain knowledge. The purpose of building Spider-DK is to simulate the scenario where specific domain knowledge is involved in the users' utterance query. Domain knowledge is often used unnoticedly, which makes some domain knowledge unavoidable. For example, in the T5 of Table 1, the direct use of the max_speed column annotation raises a domain knowledge problem. We discuss the details of this problem later in Section 2.2.
Spider-DK contains 535 NL-SQL pairs drawn from the Spider development set, where 270 pairs are the same as the original Spider samples, while the rest 265 pairs are modified to incorporate the domain knowledge. We categorize the types of domain knowledge required in Spider-DK, which makes it easy for breakdown analysis. Spider-DK is smaller than the Spider development set, because not every domain or example can be easily modified to incorporate some domain knowledge. Besides, it is hard to evaluate the model generalization ability for domain knowledge if keeping too many original Spider examples that do not require domain knowledge.
In particular, the distribution of the SQL query hardness in Spider-DK is close to the original Spider, i.e., easy accounts for 20.6%, medium accounts for 41.8%, hard accounts for 14.8%, and extra hard accounts for 19.1% 2 . We define five types of domain knowledge in Table 1. In Spider-DK, T1 accounts for 28.7% of samples, T2 accounts for 24.5%, T3 accounts for 27.5%, T4 accounts for 8.3%, and T5 accounts for 12.5%.
We curate the Spider-DK by modifying only questions or both questions and SQL queries, as shown in Table 2. We carefully add the domain knowledge into the utterance to ensure that the new utterance follows the domain knowledge required by existing Spider samples and does not raise ambiguity. Most domain knowledge in Spider-DK is similar to that in the Spider training set. Compared to the evaluation sets in (Suhr et al., 2020), Spider-DK is easier and closer to the training data and focuses only on domain knowledge, and we provide more discussion below.

Domain Knowledge
Different SQL databases could require very different domain knowledge. As shown in (Suhr et al., 2020), the state-of-the-art models on Spider achieve much worse performance on earlier SQL benchmarks such as ATIS and GeoQuery (Iyer et al., 2017;Zelle and Mooney, 1996). However, we argue that the failure of generalization is expected to some extent, because without seeing in-domain examples, some domain knowl-Only Modify the NL Spider ... in the order of birth date. Spider-DK ... order of their birth date from old to young.

Modify both NL and SQL Spider
Compute the average age of dogs. select avg(age) from dogs Spider-DK Compute the average age of abandoned dogs.
select avg(age) from dogs where abandoned_y = 1 edge required by these datasets is even hard to infer for experienced programmers. For example, we asked five computer science graduate students to write the SQL query for the question how many major cities are there? in Geo-Query, but none of them gave the correct answer. This question requires the domain knowledge that major means population > 150000 , which is hard to infer without looking at GeoQuery training set. Therefore, while acquiring general-purpose domain knowledge is also important, we believe that the failure of generalization to questions requiring similar domain knowledge to the training set could be more problematic, which motivates our design of Spider-DK benchmark.
We study five types of domain knowledge (name them as T1 to T5) shown in Table 1. T1 requires the models to understand that the user queries two columns by an omitted expression.
T2 requires the models to infer the correct queries, e.g., if the T2 utterance in Table 1 modified from date of birth to age , the model should output desc not asc. Note that the Spider training set contains both date of birth and age along with old to young . T3 requires the models to recognize the cell value synonym substitution. Some synonym substitutions base on their adjective form, such as singer whose country is France and French singer . Although the number of T4 is the least in Spider-DK, it is not uncommon in the Spider training set. Unlike the GeoQuery major example mentioned above, T4 only includes the conditions whose column type is similar to boolean. For example, in Table 1 and 2, the column decision only contain yes and no, while abandoned_y only contain 1 and 0. Therefore, the key to solving T4 is whether the model can distinguish whether the column is a boolean-like type, but the difficulty is that the word varies in different domains.
Although T5 seems simple and does not seem to contain domain knowledge, the models that generate SQL structure and schema items separately are easy to mispredict in T5. A review (Gan et al., 2020) shows that most models follow the separate generation pattern, i.e., these models may use the same word twice in both generating schema items and SQL structure. Because, in other domain training data, the models learn to generate a max() function when the utterance contains a word max. Therefore, these models may use the word max twice to generate the max(max_speed) for T5 utterance instead of a simple max_speed.

Experimental Setup
We evaluate the previous state-of-the-art models on the Spider-DK and Spider (Yu et al., 2018) We evaluate open-source models that reach competitive performance on Spider: GNN (Bogin et al., 2019), IRNet (Guo et al., 2019), RAT-SQL  with and without BERT (Devlin et al., 2019), and RAT-SQL + GAP (Shi et al., 2020).We present their results of the 265 Spider-DK domain knowledge examples and analyze their performance in each knowledge type. Our evaluation is based on the exact match metric defined in the original Spider benchmark, which measures whether the predicted query without condition values as a whole is equivalent to the gold query. Table 3 presents the exact match accuracy of different models on Spider T , Spider D , and Spider-DK.  (Guo et al., 2019) 87.2% 53.8% 33.1% RAT-SQL  93.6% 61.1% 35.8% RAT-SQL + BERT  92.0% 73.3% 40.9% RAT-SQL + GAP (Shi et al., 2020) 98.4% 67.8% 44.1%  All models are trained on the original Spider training set. Compared to Spider D , the performance of all models has significantly dropped by about 20% to 30% on Spider-DK. Although the Spider-DK is designed based on the Spider T , whose exact match evaluation is pretty high, these models can not generalize to the Spider-DK well.

Main Results
In particular, although RAT-SQL + BERT achieves better performance on Spider D than RAT-SQL + GAP, RAT-SQL + GAP outperforms RAT-SQL + BERT on Spider-DK, indicating that GAP facilitates the model to grasp a better understanding of domain knowledge. Despite some improvement achieved by recent models, the results show that domain knowledge understanding is still a considerable gap toward the realization of cross-domain text-to-SQL generation.

Performance on Knowledge Type Splits
To better understand the performance facing the domain knowledge, we present the breakdown accuracies of different domain knowledge types in Table 4. RAT-SQL + GAP unsurprisingly achieves the best performance on all examples and outperforms other models from T2 to T5. However, IRNet surprisingly obtains an overall accuracy close to the RAT-SQL + BERT, because IRNet integrates a ConceptNet (Speer and Havasi, 2012) to recognize the country, state, and city synonyms, which can improve its accuracy in T3. The GNN and RAT-SQL perform relatively poorly on T3 because they do not have extra knowledge components such as ConceptNet. Besides, GNN trains its embeddings from scratch, and RAT-SQL uses GLOVE (Pennington et al., 2014)   than BERT in many scenarios. Although Concept-Net helps IRNet in T3, it is not a generalization method for solving other domain knowledge problems. However, even the best-performing T3 is still far from the accuracy in Spider D , which shows that there is still much room for improvement.  Table 1. There are three reasons why existing models can not perform well in the Spider-DK. The first reason is that some domain knowledge is not common enough in the training set. For example, in T2, the phrase from old to young appears more often with age, which trains the model to output a desc age order. The unbalance training data may lead the model to prefer outputting a desc order even its column is the date of birth .

Error Analysis
The second reason is that the model has insufficient generalization ability for similar problems.
Many training examples belong to the T3 and T4. However, these examples can not cover all cases. For example, the training data may not or rarely contain examples where the USA is substituted with the United States, but we expect the models can still handle these examples correctly. The third reason is that a word will be used twice to generate schema items and SQL structure as we discussed the T5 in Section 2.2.

Conclusion
We introduce Spider-DK, a human-curated dataset based on the Spider benchmark for evaluating the generalization of text-to-SQL models, with the focus of understanding the domain knowledge. We demonstrate that the performance of existing textto-SQL models drops dramatically on Spider-DK, even if the domain knowledge appears in the training set. Our evaluation indicates that the models do not always understand the underlying domain knowledge for prediction, thus we consider improving the model understanding of domain knowledge as an important direction to achieve the crossdomain generalization.