Text-to-SQL Error Correction with Language Models of Code

Despite recent progress in text-to-SQL parsing, current semantic parsers are still not accurate enough for practical use. In this paper, we investigate how to build automatic text-to-SQL error correction models. Noticing that token-level edits are out of context and sometimes ambiguous, we propose building clause-level edit models instead. Besides, while most language models of code are not specifically pre-trained for SQL, they know common data structures and their operations in programming languages such as Python. Thus, we propose a novel representation for SQL queries and their edits that adheres more closely to the pre-training corpora of language models of code. Our error correction model improves the exact set match accuracy of different parsers by 2.4-6.5 and obtains up to 4.3 point absolute improvement over two strong baselines.


Introduction
Text-to-SQL parsing is a classic semantic parsing task that finds wide applications (Zelle and Mooney, 1996;Tang and Mooney, 2000). Since the release of Spider (Yu et al., 2018), a cross-database text-to-SQL benchmark, many semantic parsers with decent performance have been developed (Lin et al., 2020;Wang et al., 2020;Deng et al., 2021;Rubin and Berant, 2021;Scholak et al., 2021). Nonetheless, state-of-the-art semantic parsers are still not accurate enough. As a result, their users need to constantly correct wrongly predicted SQL queries, which can be as time-consuming and errorprone as writing a SQL query from scratch (Jorgensen and Shepperd, 2007;Weiss et al., 2007). Therefore, in this paper, we study the problem of automatic text-to-SQL error correction to better assist users in querying complex databases.
We first highlight that it is essential to factor in the compositional substructures within SQL 1 Our code and data are available at https://github. com/OSU-NLP-Group/Auto-SQL-Correction. queries, such as abstract syntax trees (Yin and Neubig, 2017;Guo et al., 2022) and data-flow graphs (Guo et al., 2021), instead of treating code snippets as string sequences. Compared to individual tokens, substructures (e.g. SQL clauses) include more context of the entire program and are more semantically meaningful. Consequently, edit patterns of such substructures are more intuitive for humans to understand and easier for language models to learn. Moreover, while the pre-training corpora for language models of code, such as CodeT5 (Wang et al., 2021), do not include many SQL queries based on their documentation, they naturally contain abundant examples of common data structures like dictionaries. Therefore, we hypothesize that transforming unfamiliar SQL queries into familiar data structures can help language models of code better perform structural editing of SQL queries.
Based on these observations, we develop our error correction model and make two contributions. First, we propose considering SQL clauses instead of tokens as basic semantic units for editing. Using a context-free grammar, we can decompose a SQL query and identify its clauses by traversing its abstract syntax tree. Second, we propose a new representation of SQL queries and their edits that adheres more closely to common code pre-training corpora, including CodeSearchNet (Husain et al., 2020), and makes the structures of a SQL query more explicit. With a decomposed SQL query, we pair each clause with its SQL keyword and represent the entire query as a Python dictionary. Then, we format edits on a wrong SQL query as a program that modifies data of the query's corresponding dictionary. Unlike token-level edits in existing work (Zhang et al., 2023), such dictionary operations define all edits unambiguously and can be directly executed with a Python interpreter.
with CodeT5; (2) simply changing token-level edits to clause-level edits can effectively improve the performance of our models; and (3) our method improves the exact set match accuracy of different parsers by 2.4-6.5 and obtains up to 4.3 point absolute improvement over two strong baselines.

Text-to-SQL Error Correction
Given a natural language utterance u, a database schema s, and a wrong SQL query q − produced by an existing parser, our goal is to develop an error correction model that predicts a sequence of edit actions e and the correct query q + . Following previous work (Zhang et al., 2023), we formulate our task as sequence-to-sequence generation: where x = [u; s; q − ] is the concatenation of the given inputs and y = [e; q + ] is the concatenation of all edit actions and the resulting correct query. In this section, we study different representations of SQL queries (Section 2.1) and edits (Section 2.2) to better leverage language models of code.

Query Representation
We consider two representations for a predicted query: (1) the original SQL format and (2) our proposed PyDict (Python Dictionary) representation.
To prepare for editing, we disambiguate each SQL query following Rubin and Berant (2021), including lower-casing non-value tokens, resolving table references, and formatting punctuation. This preprocessing normalizes SQL queries predicted by different base parsers and the gold annotations into the same format. To build our PyDict representation, we parse a SQL query into its abstract syntax tree (AST) with Spider's context-free grammar. We use depth-first search to traverse through the AST, find any nested substructures, and construct the dictionary representation bottom-up. Table 1 shows the "SQL" and "PyDict" representations of a SQL query (more details in Appendix A).

Edit Representation
We first follow Zhang et al. (2023) to use tokenlevel edit representation with special tokens (Table 1), which have unique entries in the tokenizer and the model's embedding layer to describe Replace, Insert, and Delete edit actions (more examples in Appendix F). However, we realize this representation can sometimes be ambiguous. As shown in Table 1, the span "tweets.text" appears twice in the SQL query. This repetition would confuse the error correction model with which span to replace when generating the corrected query. Also, the ambiguity makes it difficult to implement rules and directly carry out the edit actions on the wrong query. Hence, we change the token-level edit representation to clause-level, which includes more context of the query to make different edits more distinguishable. In our experiments (Section 4.1), we demonstrate that this simple modification is already effective. Our program representation further improves the performance because it is more similar to the code pre-training corpora and eliminates the need to learn special tokens' representations.
3 Experimental Setup

Data Synthesis for SQL Error Correction
To train a text-to-SQL error correction model, we need to collect a set of wrong SQL parses that reflects a realistic distribution of errors (Section 4.2) as our training data. We synthesize this dataset by  performing 5-fold cross-validation on each parser, which approximates the actual evaluation setting. Following the evaluation setup in Yu et al. (2018), we split Spider's training set into five roughly equal subsets by different databases. For each cross-validation fold, we train a text-to-SQL parser (Section 3.2) on four subsets and evaluate it on the remaining one. At inference time, we perform beam search with size 20 for each example and collect grammatical and executable parses in the beam. 2 If a SQL parse is not an exact set match or execution match to the gold annotation, we label it wrong and include it in our training set for error correction. Having synthesized our training dataset, we randomly sample 8 databases and their associated questions to construct a held-out development set. For development set examples, we only keep incorrect SQL parses with the highest beam confidence. For our error correction test set, we train each parser on the full Spider training set and evaluate it on the original Spider's development set without modifications. We similarly keep SQL parses with exact match or execution match errors. Table 2 summarizes the statistics of our data.

Models
Text-to-SQL base parsers. We choose three textto-SQL parsers with different decoding strategies and levels of performance (Table 3). We elaborate on our selection criteria in Appendix B.
• CodeT5 (Wang et al., 2021): We fine-tune CodeT5-base following Xie et al. (2022). This parser represents those using beam search decoding and having a lower accuracy. Error correction models. We use two language models of code in all our experiments: • CoditT5 (Zhang et al., 2023): A language model pre-trained for code editing tasks by injecting noises to code snippets in CodeSearch-Net (Husain et al., 2020) and then denoising with token-level edit representations. • CodeT5 (Wang et al., 2021): A language model pre-trained for general code understanding and generation with four different pre-training objectives. We compare the existing SQL+Token-Level representation with our proposed ones: SQL+Clause-Level, PyDict+Clause-Level, and PyDict+Program on CodeT5 and the first three on CoditT5. 3 Implementation details are in Appendix C.

Evaluation
We use the increase in Exact Set Match (EM) and Execution Match (EX) accuracy on our error correction test set to measure each model's performance. Because CoditT5's experiments assume the input program has at least one error, we keep this assumption for fair comparisons. To construct a test set satisfying this assumption, we have to compare parser-generated SQL queries with gold annotations (Section 3.1). Thus, we use the Spider development set as our test set and split the Spider training set to build a held-out development set (Table 2) to select model checkpoints during training. We also include results on our held-out development set in the appendix (Table E.1).

Main Results
We summarize our main results in this section. To ensure robustness, we repeat all experiments with 3 different random seeds and report the average performances with standard deviations. Our model can also be used in an interactive framework that allows users to select edit actions from the top-k beam candidates. We include more experiments with simulated user interactions in Appendix E.
Our representation's perplexity is the smallest. We validate that our PyDict+Program representation adheres more closely to the code pre-training corpora by measuring its zero-shot perplexity on CodeT5 using our development set (Section 3.1).   As shown in Figure 1, by representing data in Py-Dict, we can reduce the perplexity of CodeT5 by 2 orders of magnitude. After augmenting it with our program representation, we further reduce the zero-shot perplexity of CodeT5 to only 5.96 × 10 2 , 3 orders of magnitude less than the SQL+Token-Level representation (1.26 × 10 5 ).
Clause-level editing is more effective, especially when represented in PyDict+Program. Since CodeT5 consistently outperforms CoditT5 with the same representations, we focus on comparisons among CodeT5 variations. As shown in

Error Analysis
Additionally, we conduct an error analysis (Table  4) by sampling 100 wrong parses from all three parsers and classifying them into five categories: • Database Grounding: A generated SQL query has the correct structure, but some table/column names or entity values are wrong. • Incorrect Structure: A generated SQL query has missing, wrong, or redundant structures. • Syntax & Grammar: A generated SQL query violates the programming language's syntax. • False Negative: A generated SQL query is semantically correct but not captured by evaluation metrics, or the gold annotation is wrong. • Other: All other errors, such as wrong aggregation functions, besides the above categories. Since the error distributions for each parser are similar, as an example, we discuss our findings based on the strongest parser, SmBoP: Database grounding is the major type of error. Among the 100 samples from SmBoP, we find that 54 of them have database grounding errors. Particularly, SmBoP predicts wrong  database grounding is also a major category of errors in our synthesized training set, so our model has learned to resolve similar errors. Nevertheless, it still cannot correct the remaining 38 SQL parses. We notice that our current representation for database schema is missing critical information, such as column data types and foreign key relations, for our error correction model to fix database grounding errors. Following our PyDict representation for SQL, we suggest designing a code representation for database schema that includes such information to tackle this issue in future work.
Structural errors are hard to edit automatically. Besides database grounding, 26 of SmBoP's errors belong to another category, incorrect structure. These 26 samples contain 7 parses with incorrect SQL clauses and 19 parses with incorrect subqueries, but our CodeT5-PyDict+Program model only resolves 1 and 2 of them, respectively. We find that correcting such errors usually involves multiple edit steps, which motivates us to incorporate our model into an interactive framework in future work. As our experiments with simulated user interaction (Appendix E.2) show, when our model interacts with the simulated user to correct one clause at a time, it is able to fully correct more SQL parses. Thus, we deem interactive correction would maximize our model's utility in practice.

Related Work
Since the release of CodeBERT (Feng et al., 2020), many language models of code have emerged for program understanding and generation (Ahmad et al., 2021;Chen et al., 2021;Guo et al., 2021;Wang et al., 2021;Guo et al., 2022;Fried et al., 2023;Nijkamp et al., 2023). In addition to programrelated tasks, recent work shows they also excel at processing natural language structures. Using code as meaning representations (MRs), we can leverage language models of code in various tasks, such as commonsense reasoning (Madaan et al., 2022), action planning (Singh et al., 2022), and event extraction . In fact, how to design MRs to reduce model learning difficulty is a salient research question in semantic parsing (Guo et al., 2019;Gan et al., 2021b;Nie et al., 2022). Our work demonstrates that program-related tasks themselves can also benefit from code-based MRs. Specifically, we apply such MRs to SQL error correction, a variant of automatic program repair tasks (Tufano et al., 2019;Panthaplackel et al., 2022;Zhang et al., 2023). Although SQL is a code-based MR, it is much harder for models to learn compared to other MRs, such as FunQL and lambda calculus . Consequently, without many SQL queries in their pre-training corpora, language models of code can underperform state-of-the-art text-to-SQL parsers. By converting SQL queries into Python dictionaries, we can explicitly represent their compositional substructures and define edit actions as programs, which reduces the learning difficulty for language models of code and yields better performance.

Conclusion and Future Work
This paper presents a study on developing a text-to-SQL error correction model with clause-level edits and different representations. Our comprehensive experiments demonstrate that clauses are better semantic units than tokens for editing SQL queries and mimicking patterns in code pre-training corpora helps better leverage language models of code. As a future direction, we plan to incorporate our model into interactive semantic parsing frameworks (Li et al., 2020;Yao et al., 2019Yao et al., , 2020Zeng et al., 2020) by suggesting possible edits to users once a wrong parse is identified. In this way, users would more efficiently correct parse errors and get better assistance. We also plan to experiment with other language models of code (Fried et al., 2023;Nijkamp et al., 2023) and text-to-SQL datasets (Zelle and Mooney, 1996;Gan et al., 2021a) to verify the generalizability of our method.

Limitations
Actual applications of our model. Our work assumes that input SQL queries to our model are always wrong. This assumption is more feasible in an interactive semantic parsing framework, where the users are expected to decide whether a SQL parse, accompanied by its natural language explanations (Elgohary et al., 2020(Elgohary et al., , 2021Narechania et al., 2021;Mo et al., 2022), has errors or not. Alternatively, to remove this assumption, it would be interesting for future work to study the performance of our error correction model in combination with an automatic error detection model (Chen et al., 2023).
Experiments with more language models of code. We have only experimented with two language models of code, CoditT5 and CodeT5, both using T5-base (Raffel et al., 2020) as their underlying model architecture. It would be interesting to test how our conclusions generalize to other language models of code in the future. Based on the strong capabilities of large language models of code, such as Codex (Chen et al., 2021), InCoder (Fried et al., 2023), and CodeGen (Nijkamp et al., 2023), we believe that these models can better exploit their knowledge about data structures and their operations in Python. These models may perform even better on Text-to-SQL error correction with our proposed representations.

Acknowledgements
We would like to thank the anonymous reviewers and colleagues from the OSU NLP group for their thoughtful comments. This research was supported in part by a sponsored award from Cisco Research, NSF IIS-1815674, NSF CAREER #1942980, NSF OAC-2112606, and Ohio Supercomputer Center (Center, 1987). The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notice herein. Ziru is also supported by The Ohio State University Graduate School through University Fellowship.

A SQL PyDict Representation
We implement the transformation from any SQL query to our PyDict representation in three steps (Section 2.1). First, we use context-free grammar to parse a SQL query and obtain its abstract syntax tree (AST). The AST naturally contains a SQL decomposition where each clause has its unique subtree. In addition, if a clause contains a nested query, it would be represented as another independent subtree, which is a child of the root node in the clause's AST subtree. With these substructures explicitly represented, we use depth-first search to traverse through the AST to build our PyDict representation bottom-up. In other words, if a clause contains a subquery, we process the subquery tree as an independent SQL AST and build a dictionary for it. Then, we combine it with other substructures of the clause with different dictionary keys. For example, in Table F.1, we first build the dictionary for "subquery0" and assign this identifier as the key. In the main "clause," we replace the subquery's corresponding span with this identifier. Finally, we use another dictionary to wrap the main "clause" and "subquery0" together as the final representation of the "where" clause. We repeat this procedure for each clause to incrementally add (key, value) pairs to the dictionary and "store" it to the variable sql, which we refer to in program edit representations.

B Text-to-SQL Parser Selection
We choose existing text-to-SQL parsers in our experiments according to two principles: the parsers predict database entity values, and they cover different decoding strategies, including grammar-based (BRIDGEv2), bottom-up (SmBop), and tokenbased (CodeT5). We did not include parsers using top-down decoders because they usually cannot predict entity values in conditional statements, such as RAT-SQL (Wang et al., 2020). Instead, we include BRIDGEv2 because its decoding method mimics the left-to-right CFG derivation of a program, and it uses SQL syntax-based constraints to prevent grammatical errors. In recent work, such decoders, also used in PICARD (Scholak et al., 2021), are more popular than top-down decoders.

C Implementation Details
Our models (Section 3.2) are implemented in Py-Torch (Paszke et al., 2019) using Huggingface (Wolf et al., 2020) and trained on a single NVIDIA RTX A6000 GPU (48GB

D Statistical Significance Test
To demonstrate the effectiveness of our three clause-level edit representations (Section 4.1), we perform McNemar's Test (McNemar, 1947) to measure the statistical significance of their results in comparison to CodeT5-SQL+Token-Level. For each significance test between two models, we use the median results among our three runs to calculate the comparison matrix. Then, we compute the p-values using statsmodels. 4 When p < 0.05, we reject the null hypothesis. In other words, we consider the accuracy improvement statistically significant when p < 0.05.

E Additional Results
Results on our development set. We report model performances on our held-out development set (Section 3.1) in

Results for simulated interaction experiments.
To show the potential of using our model in an interactive framework, we extend our main experiments (Section 4.1) by adding simulated user interactions. Since our model uses beam search to decode the edit actions e = {e 1 , e 2 , ..., e n } and the resulting correct SQL query q + (Equation 1), we simulate user interactions to select one edit action e i at a time from the beam results. At each time step t, we prompt the decoder with previously selected edit actions e 1 , ..., e t−1 to complete the sequence e t , ..., e n , q + using beam search with size 3. Then, we use gold SQL annotations to simulate the user interaction, which selects an edit action e t from the three candidates at step t or chooses to skip the current step when all three candidates are wrong. If skipping, the user continues to check the consequent edit actions e t+j (j = 1, 2, ..., n − t) until it selects the next edit action. When the interaction finishes, we append the selected edit action to the prompt and let the model regenerate a completion with the new prompt for the next step's interaction. Having simulated interactions for all edit actions, we do not use the generated q + directly because some edit actions are skipped. Instead, we execute the selected ones on the initial SQL query to derive the final query.
As shown in Table E.2, when collaborating with a simulated user, our error correction model can further improve the base parsers' accuracy. Compared to its performance without using any interactions, our model achieves up to 4.1 point more absolute improvement on EM accuracy (72.5 → 76.6; BRIDGEv2) and 5.0 point more absolute improvement on EX accuracy (73.1 → 78.1; BRIDGEv2). With these results for simulated interaction experiments, we deem that incorporating our error correction model into an interactive framework is a promising future direction.