Exploring Schema Generalizability of Text-to-SQL

,


Introduction
Given the corresponding database, text-to-SQL (Yu et al., 2018) aims to convert a natural language (NL) utterance into a structured SQL program.Recently, many advanced text-to-SQL models, such as RAT-SQL (Wang et al., 2019) and LGESQL (Cao et al., 2021), have been proposed to tackle this task.
Although significant progress has been achieved considering the ultimate accuracy, many researchers point out that actual performances of current text-to-SQL systems are over-estimated.Suhr et al. (2020) observed a dramatic performance decline when evaluating a state-of-the-art model on unseen datasets.Gan et al. (2021a)   attack from synonyms of words in user questions.To explore the generalizablity, previous literature mainly focused on the semantically diversity of natural language.However, the topological feature of database schema is also important but is less investigated while studying the generalizablity in text-to-SQL tasks.
We named the ability to automatically adapt different schema the schema generalizablity.Different databases will lead to completely divergent SQL queries even given the same user question.For example, in Figure 1, the SQL queries become different when the query entity "singer" functions as a column, a table, or a specific cell value, depending on the ontology of the corresponding DS.Furthermore, although the current cross-domain text-to-SQL datasets use different databases during training and evaluation, they are insufficient for evaluating the schema generalizablity of textto-SQL systems.In Section 3.1, we observe that models can predict the structure of SQL queries even without the corresponding database, which may result from the limited database structures of current datasets.
In this work, we focus on studying the schema generalizablity of current STOA text-to-SQL systems and provide in-deep analysis.To avoid the aforementioned problems in existing datasets, we propose a data-and structure-driven framework to automatically synthesize new (DS, SQL) pairs given the same input question.The framework modifies the DS with a modest annotation cost and updates the SQL synchronously by altering the abstract syntax tree (AST).Inspired by the entityrelationships diagram (E-R Diagram) (Ling, 1985;Li and Chen, 2009), all the transformations follow the entity relationships of the database to guarantee that the modifications are reasonable.We also compared the execution results between new and original (DS, SQL) pairs to ensure the correctness of SQL updating.
Our experiments demonstrate that all four strong text-to-SQL models (RATSQL (Wang et al., 2019), LGESQL (Cao et al., 2021), T5 (Raffel et al., 2020), and T5-PICARD (Scholak et al., 2021)) suffer from poor schema generalizability.After generating the adversarial set from the Spider (Yu et al., 2018) dev set, adding perturbations to the database schema reduces the EM accuracy from an average of 67% to 35%.Even the performance on the adversarial set from the Spider training set drops dramatically (-46 points on EM).Furthermore, we observe that the adversarial examples that both DS and SQL changed are much more challenging for text-to-SQL models than those examples that only DS changed.Finally, we discuss the efficiency of additional training on adversarial examples (Jia and Liang, 2017).Experiment results show that the performance improvement mostly stems from the additional question-DS patterns by more training examples.
The main contributions are as follows: • We propose a data-and structure-driven framework which can automatically synthesize samples containing unseen (DS, SQL) patterns with minimal human labor.This framework and corresponding synthesis data will be publicly available at https://github.com/Auracion/schema_generation_framework.
• By utilizing the plug-and-play framework, we synthesize a test suite and demonstrate the poor performance of existing text-to-SQL models regarding schema generalization.
• We analyze the reasons leading to modest generalization towards perturbations of synchronous changes in (DS, SQL) pairs and demonstrate that adversarial training is a possible way to inhibit the overfitting problem.

Background and Related Work
Structural Features in Text-to-SQL Tasks Modeling the structural information in a database and designing an efficient algorithm to decode structured output sequences are crucial in text-to-SQL.Several studies achieved remarkable progress using GNN (Scarselli et al., 2008) to encode the schema linking, which enhanced the graph structure of DS and the relationships between DS and question tokens (Bogin et al., 2019;Lin et al., 2020;Chen et al., 2021;Hui et al., 2022;Wang et al., 2019;Cao et al., 2021).Another line of research focuses on the grammar structure of SQL.
1 Considered most related studies report the EM.accuracy as the results, we additionally reproduce the experiments and use the EM.accuracy as the metric and illustrate the results in Table 2.

Suitable Evaluation Data
To evaluate the schema generalizability of text-to-SQL models, a test dataset with novel databases is crucial.However, current text-to-SQL datasets are not suitable because of the over-templated features (Section 3.1).Therefore, we propose a dataand structure-driven generation framework to synthesize relevant data to assess the generalization capability (Section 3.2).

Current Datasets are Undesirable
To verify that current text-to-SQL datasets are overtemplated, we conduct a syntax role prediction experiment.As the example shown in Table 3, the structure feature of SQL can be represented using the syntax roles.We show the details of all used syntax role labels in Appendix D Syntax Role Prediction aims to predict which SQL syntax roles are mentioned in the query, including the SQL keywords, nested structure, and aggregation clause.For the user question Q = (q 0 , q 1 , ..., q |Q| ), the given database schema D, and the corresponding SQL S = (s 0 , s 1 , ..., s |S| ).Set R = {r 0 , r 1 , ..., r |R| } contains all the predefined syntax roles involved in S. We formulate the syntax role prediction task as , where X = Q if using database schema information otherwise X = (Q, D).The metric used in this experiment is joint accuracy, which means the case is treated as correct if and only if all the syntax roles are correctly predicted.
In this experiment, we compare the performances of whether it contains database schema in the inputs.For the model only uses user questions, we encode the inputs as (2) For the model uses both user questions and database schema information, we encode the input as where RAT-Encoder is the encoder of RAT-SQL (Wang et al., 2019).We calculate the probability of using the role r i as where v i is learnable parameters corresponding to syntax role r i .
As the results in Table 4 shown, the performances of the models without using DS information (column 4) achieve 79.31 on average.The model can directly predict the approximate structure of the target SQL only with the user question most of the time, even though the databases for training and for testing are not overlapping.Meanwhile, the performances of the models using database schema information (column 5) achieve 80.13 on average.The experiment results illustrate that using DS information can only improve 0.82 on average.The performance differences between using and without using DS demonstrate that the DS information is helpless for predicting the SQL structure.Additionally, we find that the phenomena not only happen when evaluating on Spider-like datasets but also exist in other text-to-SQL datasets.Therefore, we suspect that current datasets are too templated to evaluate the generalizability using them.To this end, we need to synthesize suitable evaluation data.

Evaluation Data Generation
To assess the structural generation capability, we propose a data-and structure-driven generation framework to synthesize relevant data.The synthetic data in this paper are modified from Spider (Yu et al., 2018)  For a given sample, we synthesize a new sample by altering the DS while keeping the question constant.In order to obtain a reasonable DS, we construct the entity-relationship graph of the given DS and apply graph-based transformations.Moreover, we synchronously update the SQL by modifying the abstract syntax tree.We show more details in Appendix A In this work, we use four different transformations in DS. Figure 2 illustrates the examples of each transformation, and we show a brief introduction below: • Entity to Attribute (E2A) merges two tables into one.
• Concept to Attribute (C2A) converts the concept3 of an entity, which represents via table name in DS, to its attribute.
• Named to Unnamed (N2U) replaces the table corresponding to a relationship with foreign keys.
• Unnamed to Named (U2N) replaces a foreign key with a relationship table.
Table 5 shows the total number of each kind of synthetic data synthesized via different E-R transformations.We evaluate the synthetic quality by comparing the execution results of the original and synthetic (DS, SQL) pairs.Over 90.43% generated samples kept consistent execution results on average.In this work, we only consider 1-step transformation regard of the problem of textual noise accumulation in automatic multi-step transformation.

Generalization Evaluation
In this section, we conduct experiments to evaluate the practical generalization of current text-to-SQL models:

Experiment Setup
In this work, we experiment with two grammarbased SOTA text-to-SQL parsers, RATSQL (Wang et al., 2019) and LGESQL (Cao et al., 2021).Besides, we also experiment with the T5-based end-toend text-to-SQL parser, including the methods of decoding with and without PICARD (Scholak et al., 2021).The evaluation metric we use to report the results is exact set match accuracy (EM).Results are averaged over three trials to reduce variance.

Equivalent Test Set (ETS)
To precisely evaluate the model robustness, we construct an equivalent test set for the given dataset, which contains the same number of samples.We restrict that each sample in the original dataset matches exactly one synthetic variant in the ETS.If a sample can not generate a variant, we will add the duplication in the ETS.In this work, the percentages of these samples in ETS are 34.3% and 14.7% for affected and unaffected respectively.Furthermore, to reduce the influence of hardness4 , we utilize a heuristic algorithm to modulate the ETS so that its distribution is close to the original dataset.We show more details of the algorithm in Appendix E

Practical Schema Generalization
We construct the equivalent test set (ETS) for both of the training set and the development set of Spider.6 illustrate a dramatical performance decline.These phenomena demonstrate that the practical schema generalization capability is also modest, which is similar to the structural robustness.Therefore, we suspect that current text-to-SQL parsers can not automatically infer the SQL pattern according to the DS.We will discuss the true reason that caused this issue in the next section.

Discussion about Schema Generalizability
In this section, we discuss the schema generalizability of text-to-SQL by answering the following questions: • Q1: What is the actual function of database schema input?(Section 5.1) • Q2: What is the actual reason causing the modest generalizability?(Section 5.2)

Function of Database Schema Input
To answer Q1, we first verify that the database schema (DS) information is independent of the process of constructing SQL patterns.Reviewing the experiments in Section 4, models always make mistakes when facing out-of-dataset (OOD) DS.To estimate whether the OOD structure confuses the parsers, we consider the evaluation data containing OOD DS while keeping the SQL query unchanged.Setup: Different from the evaluation data used in Section 4, we generate the data with different DS but the same SQL.For each piece of data, the DS transformations are applied to untapped parts so that the SQL will not be influenced.Similarly, we Results of using a grammar-based decoder (line 1-4) shown in Table 7 demonstrate that the OOD structure does not influence the inference process.Reviewing the syntax role prediction experiments discussed in Section 3.1, we suggest that current text-to-SQL models construct SQL query via sentence pattern of the user question rather than the actual structure of DS.We suspect that the function of DS input is providing the correct presentation of the SQL non-keywords (table name, column name, and value).The efficiency of using schema linking provides a strong signal on the target database item.Once the explicit relationships between these SQL non-keywords and the presentations in question are destroyed, models will make mistakes in selecting the correct schema item.However, the SQL structure is always predicted in the correct ways (Gan et al., 2021a).Results of using a tokenbased decoder (line 5-8) in Table 7 illustrate the remarkable performance decline, which seems in contrast to the previous conclusion.We analyze the error cases and suggest that this issue is caused by the unnatural schema item names, which we report in Limitations.

(NL, DS) Pattern
To answer Q2, we first introduce the concept of (NL, DS) pattern.The (NL, DS) pattern represents the combination of a natural language (NL) role and a database schema (DS) role.Then we will illustrate how the (NL, DS) pattern influences the generalizability.

NL Role:
As the examples in Figure 3 shown, we assume that some words (except stop words) in the NL question describe the key information of the query.In this section, we simply split these keywords into two categories5 , target and condition, which we call them the NL role of these words.target represents the querying entity we focus on.For instance, in the first example, we attempt to obtain the number of "people", and "people" is a target in this case.condition represents the constraint of the target.For example, the specific "people" we querying is restricted with the condition "whose identity is singer".Therefore, the condition keywords are "identity" and "people".The NL roles are DS-independent, in other words, they only depend on the semantics of the NL question.DS Role: For a DS, some elements link the keywords in the given question, such as the word "singer" in the first case, and all of them play a unique role in the given DS.We define the DS role as table, column, cell value and a padding role to link the non-schema-related keywords, for instance, the word "people" in the second case.
(NL, DS) Pattern: For each of these elements, we named the combination of an NL role and a DS role as an (NL, DS) pattern, which determines the syntax role in SQL.For example, the element "singer" in the first case functions the NL role condition   and the DS role cell value, in this case, it locates in the WHERE clause.However, when the DS role comes to table, as shown in the second case, the element "singer" will locate in the FROM clause.For the given NL question and DS, the structure of the SQL query depends on the containing (NL, DS) patterns.
We assume that the modest generalization capability is because of the over-fitting of (NL, DS) patterns.Unseen (NL, DS) patterns in the evaluation stage lead to failed parsing.To verify it, we first evaluate the performance on the samples synthesized via different E-R transformations.The experiment results are illustrated in Table 4.We notice that models make mistakes on almost all the samples generated using C2A E-R transformation.Actually, C2A is a special transformation that must create an (NL, DS) pattern, (target, cell value).In general, this pattern represents a condition in the WHERE clause.On the other hand, we randomly sample 100 pieces of data from the training set and the synthetic C2A data to evaluate whether (target, cell value) is not in the original dataset but appears in the synthesis dataset.Table 8 shows the statistic results of manually calculating the distribution of (NL, SQL) patterns.The combination (target, cell value) is not contained in the training set but exists as unseen patterns when it comes to C2A samples.We additionally enumerate some typical error cases in Appendix C. The examples demonstrate that models tend to parse according to experiences so that they make mistakes on novel patterns.In this case, we suggest that the actual reason causing the modest generalization capability is the (NL, DS)-pattern-wise over-fitting.

Pattern-Specific Adversarial Training
In this section, we study whether adversarial training can improve structural generalization by evaluating the efficiency of training with extra synthetic data.
Setup: We conduct experiments to on both original and synthetic evaluation data.As the adversarial training, we train models with the original training set of Spider and additional synthetic data with a 1:0.2 ratio.We consider two kinds of extra synthetic training data in these experiments.The one is the data containing novel database schema (DS) and different SQL queries (compared with the original data).They are similar to the evaluation data we used to build the ETS in Section 4. We named these data Affected.The other are the data containing novel DS while the same SQL queries, which is similar to the data from the Dev.ETS in Section 5.1.We named them Unaffected.The synthetic evaluation data we used in this section is Affected.We report our results in Table 9. Experiment results in the upper block (line 1-3) illustrate that neither affected nor unaffected extra training data can improve the performance on the original development set.The reason is that extra training data do not provide the (NL, DS) patterns which are rare in the original training set but appear in the original development set.Actually, the problem of over-templated demonstrates that it is hard to find the aforementioned patterns.Unlikely, the transformations applied in this work either do not guarantee these patterns are created.
Experiment results in the lower block (line 4-6) of Table 9 show that the affected extra training data is helpful to improve the performance on the synthetic evaluation data.However, the usage of unaffected data can not.The reason is that the former provides the (NL, DS) patterns which are rare in the original training set while are contained in the ETS.On the other hand, the latter do not provide any of these patterns because the perturbations are applied on untapped parts of DS in unaffected data.Therefore, we suggest that specific adversarial training can enhance the model despite it can not be verified on current datasets.This experiment amplifies the improvement of adversarial training by increasing the overlap of (NL, DS) patterns between the extra training data and the synthetic evaluation data.Therefore, we suggest that adversarial training is a possible way to improve structure generalization capability, and it needs more investigation in future works.

Not Only in Cross-Domain
Actually, the problem of (NL, DS)-pattern-wise overfitting is not the specific problem that only exists in cross-domain text-to-SQL.Modest structural generalization is just one of the phenomena under a cross-domain setup.Single-domain text-to-SQL also has the same problem.
From the view of (NL, DS) patterns, the deficiency of patterns in the training stage leads to the appearance of unseen patterns in the test stage and further causes performance decline.However, leaving out patterns is inevitable during the data collection process.Annotators can neither ensure to ask questions in all possible sentence patterns nor guarantee that all combinations of schema items are considered.For instance, as the example illustrated in Figure 1, confronting the third DS, annotators may not come up with the question about "singer", or they may ask in the way of How many people whose identity is a singer?.In this case, automatically addressing unseen patterns is also essential in single-domain text-to-SQL.

Conclusion
In this work, we first report that current text-to-SQL datasets are too templated to investigate generalization capability.To this end, we constructed a generation framework to synthesize text-to-SQL data for evaluation in this work.Experiment results illustrate that the model generalization is poor.Furthermore, the analysis illustrates that the problem is caused by the overfitting of (NL, DS) patterns.Finally, we demonstrate that when adding extra training data to bring more unseen patterns in the evaluation stage, the performance will improve.Adversarial training is a possible way to enhance the text-to-SQL parser.

A Generation Framework
The overview of the generation framework is shown in Figure 5.For a given sample, we synthesize a new sample via altering the DS while keeping the question constant.In order to obtain a reasonable DS, we construct the entity-relationship graph of the given DS and apply graph-based transformations, which we introduce in Section A.1 and Section A.2 respectively.Moreover, we synchronously update the SQL by modifying the abstract syntax tree, and we show more details in Section A.3.

A.1 Entity-Relationship Graph
A relational database organizes data in predefined relationships, which are represented as the structural relationships among tables and columns.To clearly describe reasonable relationships, developers always utilize Entity-Relationship Diagram (E-R Diagram) (Ling, 1985;Li and Chen, 2009) to define the relationships among the raw data, which is helpful to design the database structure.Inspired by ER Diagram, we attempt to modify the DS following the entity relationships so that the rationality of the altered DS can be ensured.To this end, we introduce the definition of Entity-Relationship (E-R) Graph in this paper, which evolves from E-R Diagram while leaving out the attributes vertexes to emphasize the topology feature6 .The vertex in E-R Graph represents an entity, and the edge represents the relationship between the entities that its terminal vertexes correspond.Both the vertex and the edge function as a table in DS.For example, as shown in Figure 5, each of the table people, the table author, and the table novel corresponds to a vertex in E-R Graph, and the table write corresponds to an edge.Thus, to construct the E-R Graph, we manually annotate a binary tag for each table in DS to distinguish between entity and relationship.We label relationship following two principles and label the others as entity: Bridge Structure: The given table should contain exactly two foreign keys.
Semantic Convention: The table name should be the combination of two entities such as the relationship Customer_Addresses combining Customer and Address.Apart from that, the phrase obeys human language conventions is also considered.For instance, the relationship visit linking visitor and museum.

A.2 E-R Transformation
E-R transformation is the graph transformation in the E-R graph.There are ten kinds of E-R transformation, containing five operations applied on vertexes or edges.We assume the databases that store the same data in the different schema can transform between each other via a sequence of E-R transformations.We illustrate all kinds of E-R transformations and the corresponding transformations in DS in Appendix.However, some transformations are insecure.For example, the usage of delete edge transformation will lead to information loss.Besides, some transformations rely on strict annotation criteria and costly manual labeling.For instance, whether a table can be split into two need rigorous judgment according to the semantic environment.In this work, we use three E-R transformations with no need for additional annotations, and they totally correspond to four different transformations in DS. Figure 2 illustrates the examples of each transformation, and we show more details below.
Entity to Attribute (E2A) corresponds to a kind of merge vertexes E-R transformation.For a pair of vertexes in the E-R graph, we split them as a the relationship represented by a table as Named, and that by foreign keys as Unnamed.For instance, in the original DS illustrated in Figure 2, the table sing is a named relationship and the foreign key aid in the table song represents an unnamed relationship.We change the type of relationship by creating a foreign key of one table in the other table, as the example shows.
Unnamed to Named (U2N) also corresponds to a kind of modify edge E-R transformation, which is the reversed transformation of Named to Unnamed.We create a relationship table and name it with the combination of two target table names to store the relationship.Then, we build the connection by transferring the foreign key in the table and creating another foreign key in it, as the example in Figure 2 shows.

A.3 AST Updating
To update the SQL precisely, we construct the AST of the given SQL following grammar rules and alter the SQL by modifying the AST.For each E-R transformation, we detect related subtrees in the AST and apply the corresponding rule to update the subtrees.For instance, we add an additional condition subtree in the corresponding WHERE subtrees while applying concept to attribute transformation.Finally, we parse the altered SQL with the modified AST.
In this work, we consider two type of synthetic data, affected and unaffected.Affected samples contain different SQL compared with the original data, and the unaffected contain the same.We distinguish these two types according to whether the SQL involves a DS element that is influenced by the transformation.And the AST updating module is only used to synthesize affected data.

C Errors Cases in Synthetic Evaluation Data
Examples in Figure 6, Figure 7, and Figure 8 illustrate that models tend to predict following familiar sketch.

Figure 1 :
Figure 1: Given the same question, the target SQL responds in different ways when the database schema is different.

Figure 2 :
Figure 2: Examples of the DS synthesized via different transformations.The dotted lines denote foreign keys (from foreign key to primary key)

Question:Figure 3 :
Figure 3: Different combinations of NL role and DS role determine different SQL sketches.

Figure 4 :
Figure 4: Training without extra data, evaluating on different synthetic samples.

Figure 6 :
Figure 6: An example of failure prediction.

Figure 7 :
Figure 7: An example of failure prediction.

Figure 8 :
Figure 8: Another example of failure prediction.

Table Column Cell Value Table Column Cell Value Table Column Cell Value
discovered that current parsers are vulnerable to the adversarial SQL: SELECT count(*) FROM people WHERE identity = 'singer'

Table 1 :
Setups of previous evaluation datasets and our synthetic samples.The synthetic evaluation data was modified from Spider.Unaffected and Affected are two types of data we synthesize in this work.We introduce them in Section 6 The mark ✓represents that the corresponding attribute is different from that in Spider.Oppositely, we use to note in this table.

Table 3 :
The structure of SQL can be represented with the syntax roles.

Table 4 :
Experiment results of syntax role prediction.w/o.DB Schema represents a vanila model using BERT-base to encode user questions.W. DB Schema represents the model using RAT encoder to process the user questions and database schema.
2 which is the most popular cross-domain text-to-SQL dataset.It contains 8659 training examples and 1034 validation examples across 146 databases.The test dataset is unseen and contains 2147 samples with 40 databases.

Table 5 :
Statistics of generated data for four transformations.Affected represents the samples containing different SQL.Unaffected represents the samples that the SQL query does not change when applying DS transformations.

Table 6 :
(Devlin et al., 2019)this experiment is the Spider training set.We compare the performances on the Spider training set, Spider development set, and their corresponding ETSs.Experiment results (Spider Train.vsSpiderTrain.ETS) illustrated in Table6indicate that the perturbation applied to the database schema will disturb the parsing process.The models can not precisely infer the representation of the SQL Model Spider Train.Spider Train.ETS Spider Dev.Spider Dev.ETS Models are trained on the Spider training set.The backbones of RATSQL and LGESQL are Bertbase(Devlin et al., 2019).The size of T5 is also base.query when confronting novel DS structures despite the questions and the other parts of the DS being the same as they appeared in the training phrase.When it comes to the development set, as well as the corresponding ETS, databases are completely novel because they do not overlap with the databases in the training set.However, experiment results (Spider Dev.vs Spider Dev.ETS) in Table

Table 7 :
EM. accuracy on the evaluation data synthesized from the Spider training set.The SQL in synthetic data is consistent with the original.also construct the equivalent test set (ETS) for the Spider development set with this kind of synthetic data.The training data in this experiment is the training set of Spider.

Table 8 :
Distribution of (NL, DS) pattern in the training set and C2A samples.
Table10illustrates all kind of E-R transformations and the corresponding transformation in DS.

Table 10 :
All kinds of transformation and the necessity of labels.Considered the cost of manual annotation, we only choose three of them in this work.
Table 11 illustrates all the syntax role labels in the syntax role prediction experiment.Hardness-Oriented Sampling Algorithm Require: original examples x 1 , x 2 , ..., x n Ensure: synthesis samples y 1 , y 2 , ..., y n 1: D x ← get_hardness_distribution(x 1 , x 2 , ..., x n ) 2: initialize the hardness distribution of synthesis samples: D y ← ϕ 3: R ← ϕ 4:for i = 1 to n do C i ← find_same_hardness(X i , x i )Find the hardness category with the largest difference between the current distribution and the original distribution: h ← find_hard_cat(D x , D y ) Sample an X i from R which contain at least one variant with hardness h 22:if Can not sample an X i then end while 26: return y 1 , y 2 , ..., y n