Towards Robustness of Text-to-SQL Models against Synonym Substitution

Recently, there has been significant progress in studying neural networks to translate text descriptions into SQL queries. Despite achieving good performance on some public benchmarks, existing text-to-SQL models typically rely on the lexical matching between words in natural language (NL) questions and tokens in table schemas, which may render the models vulnerable to attacks that break the schema linking mechanism. In this work, we investigate the robustness of text-to-SQL models to synonym substitution. In particular, we introduce Spider-Syn, a human-curated dataset based on the Spider benchmark for text-to-SQL translation. NL questions in Spider-Syn are modified from Spider, by replacing their schema-related words with manually selected synonyms that reflect real-world question paraphrases. We observe that the accuracy dramatically drops by eliminating such explicit correspondence between NL questions and table schemas, even if the synonyms are not adversarially selected to conduct worst-case attacks. Finally, we present two categories of approaches to improve the model robustness. The first category of approaches utilizes additional synonym annotations for table schemas by modifying the model input, while the second category is based on adversarial training. We demonstrate that both categories of approaches significantly outperform their counterparts without the defense, and the first category of approaches are more effective.


Introduction
Neural networks have become the defacto approach for various natural language processing tasks, in-1 Following the prior work on adversarial learning, worstcase adversarial attacks mean adversarial examples generated by attacking specific models. 2 Our code and dataset is available at https://github.com/ygan/Spider-Syn What is the type of the file named "David CV"?
What is the type of the document named "David CV"? "document", "users", …… SELECT document_type FROM documents …… What is the average power for all automobiles produced before 1980?
What is the average horsepower for all cars produced before 1980?
"horsepower", "cars data", …… SELECT avg(horsepower) FROM CARS_DATA ……  Figure 1: Sample Spider questions that include the same tokens as the table schema annotations, and such questions constitute the majority of the Spider benchmark. In our Spider-Syn benchmark, we replace some schema words in the NL question with their synonyms, without changing the SQL query to synthesize.
cluding text-to-SQL translation. Various benchmarks have been proposed for this task, including earlier small-scale single-domain datasets such as ATIS and GeoQuery (Yaghmazadeh et al., 2017;Iyer et al., 2017;Zelle and Mooney, 1996), and recent large-scale cross-domain datasets such as WikiSQL (Zhong et al., 2017) and Spider (Yu et al., 2018b). While WikiSQL only contains simple SQL queries executed on single tables, Spider covers more complex SQL structures, e.g., joining of multiple tables and nested queries. The state-of-the-art models have achieved impressive performance on text-to-SQL tasks, e.g., around 70% accuracy on the Spider test set, even if the model is tested on databases that are unseen in training. However, we suspect that such crossdomain generalization heavily relies on the exact lexical matching between the NL question and the table schema. As shown in Figure 1, names of tables and columns in the SQL query are explicitly stated in the NL question. Such questions constitute the majority of cross-domain text-to-SQL benchmarks including both Spider and WikiSQL.
Although assuming exact lexical matching is a good starting point to solving the text-to-SQL problem, this assumption usually does not hold in realworld scenarios. Specifically, it requires that users have precise knowledge of the table schemas to be included in the SQL query, which could be tedious for synthesizing complex SQL queries.
In this work, we investigate whether state-of-theart text-to-SQL models preserve good prediction performance without the assumption of exact lexical matching, where NL questions use synonyms to refer to tables or columns in SQL queries. We call such NL questions synonym substitution questions. Although some existing approaches can automatically generate synonymous substitution examples, these examples may deviate from real-world scenarios, e.g., they may not follow common human writing styles, or even accidentally becomes inconsistent with the annotated SQL query. To provide a reliable benchmark for evaluating model performance on synonym substitution questions, we introduce Spider-Syn, a human-curated dataset constructed by modifying NL questions in the Spider dataset. Specifically, we replace the schema annotations in the NL question with synonyms, manually selected so as not to change the corresponding SQL query, as shown in Figure 1. We demonstrate that when models are only trained on the original Spider dataset, they suffer a significant performance drop on Spider-Syn, even though the Spider-Syn benchmark is not constructed to exploit the worstcase attacks for text-to-SQL models. It is therefore clear that the performance of these models will suffer in real-world use, particularly in cross-domain scenarios.
To improve the robustness of text-to-SQL models, we utilize synonyms of table schema words, which are either manually annotated, or automatically generated when no annotation is available. We investigate two categories of approaches to incorporate these synonyms. The first category of approaches modify the schema annotations of the model input, so that they align better with the NL question. No additional training is required for these approaches. The second category of approaches are based on adversarial training, where we augment the training set with NL questions modified by synonym substitution. Both categories of approaches significantly improve the robustness, and the first category is both effective and requires less computational resources. In short, we make the following contributions: • We conduct a comprehensive study to evaluate the robustness of text-to-SQL models against synonym substitution. • Besides worst-case adversarial attacks, we further introduce Spider-Syn, a human-curated dataset built upon Spider, to evaluate synonym substitution for real-world question paraphrases. • We propose a simple yet effective approach to utilize multiple schema annotations, without the need of additional training. We show that our approach outperforms adversarial training methods on Spider-Syn, and achieves competitive performance on worst-case adversarial attacks. SELECT id FROM highschooler EXCEPT…… "teacher", "name ", ……, "high schooler", "friend", ……, What are the names of people who teach math courses ?
Show ids of all students who do not have any friends.
What is the name and capacity of the stadium with the most concerts?

Conduct Principle
The goal of constructing the Spider-Syn dataset is not to perform worst-case adversarial attacks against existing text-to-SQL models, but to investigate the model robustness for paraphrasing schemarelated words, which is particularly important when users do not have the knowledge of table schemas. We carefully select the synonyms to replace the original text to ensure that new words will not cause ambiguity in some domains. For example, the word 'country' can often be used to replace the word 'nationality'. However, we did not replace it in the domain whose 'country' means people's 'born country' different from its other schema item, 'nationality'. Besides, some synonym substitutions are only valid in the specific domain. For example, the word 'number' and 'code' are not generally synonymous, but 'flight number' can be replaced by 'flight code' in the aviation domain.
Most synonym substitutions use relatively common words 3 to replace the schema item words. Besides, we denote 'id', 'age', 'name', and 'year' as reserved words, which are the most standard words to represent their meanings. Under this principle, we keep some original Spider examples unchanged in Spider-Syn. Our synonym substitution does not guarantee that the modified NL question has the exact same meaning as the original question, but guarantees that its corresponding SQL is consistent. In Figure 2, Spider-Syn replaces the cell value word 'dog' with 'puppy'. Although puppy is only 3 According to 20,000 most common English words in https://github.com/first20hours/ google-10000-english.  a subset of dog, the corresponding SQL for the Spider-Syn question should still use the word 'dog' instead of the word 'puppy' because there is only dog type in the database and no puppy type. Similar reasoning is needed to infer that the word 'female' corresponds to 'F' in Figure 2.
In some cases, words are replaced by synonymous phrases (rather than single words), as shown in Figure 3. Besides, some substitutions are also based on the database contents. For example, a column 'location' of the database 'employee hire evaluation' in Spider only stores city names as cell values. Without knowing the table schema, users are more likely to call 'city' instead of 'location' in their NL questions.
To summarize, we construct Spider-Syn with the following principles: • Spider-Syn is not constructed to exploit the worstcase adversarial attacks, but to represent realworld use scenarios; it therefore uses only relatively common words as substitutions. • We conduct synonym substitution only for words related to schema items and cell values. • Synonym substitution includes both single words and phrases with multiple words.

Annotation Steps
Before annotation, we first separate original Spider samples based on their domains. For each domain, we only utilize synonyms that are suitable for that domain. We recruit four graduate students major in computer science to annotate the dataset manually. They are trained with a detailed annotation guideline, principles, and some samples. One is allowed to start after his trial samples are approved by the whole team. As synonyms can be freely chosen by annotators, standard inter-annotator agreement metrics are not sufficient to confirm the data quality. Instead, we conduct the quality control with two rounds of re-view. The first round is the cross-review between annotations. We require the annotators to discuss their disagreed annotations and come up with a final result out of consensus. To improve the work efficiency, we extract all synonym substitutions as a report without the NL questions from the annotated data, as shown in Figure 4. Then, the annotators do not have to go through the NL questions one by one. The second round of review is similar to the first round but is done by native English speakers.

Dataset Statistics
In Spider-Syn, 5672 questions are modified compared to the original Spider dataset. In 5634 cases the schema item words are modified, with the cell value words modified in only 27 cases.We use 273 synonymous words and 189 synonymous phrases to replace approximately 492 different words or phrases in these questions. In all Spider-Syn examples, there is an average of 0.997 change per question and 7.7 words or phrases modified per domain.
Besides, Spider-Syn keeps 2201 and 161 original Spider questions in the training and development set, respectively. In the modification between the training and development sets, 52 modified words or phrases were the same, accounting for 35% of the modification in the development set.

Defense Approaches
We present two categories of approaches for improving model robustness to synonym substitution. We first introduce our multiple annotation selection approach, which could utilize multiple annotations for one schema item. Then we present an adversarial training method based on analysis of the NL question and domain information.

Multi-Annotation Selection (MAS)
The synonym substitution problem emerges when users do not call the exact names in table schemas to query the database. Therefore, one defense against synonym substitution is utilizing multiple annotation words to represent the table schema, so that the schema linking mechanism is still effective. For example, for a database table with the name 'country', we annotate additional table names with similar meanings, e.g., 'nation', 'State', etc. In this way, we explicitly inform the text-to-SQL models that all these words refer to the same table, thus the table should be called in the SQL query when the NL question includes any of the annotated words.
We design a simple yet effective mechanism to incorporate multiple annotation words, called multiple-annotation selection (MAS). For each schema item, we check whether any annotations appear in the NL question, and we select such annotations as the model input. When no annotation appears in the question, we select the default schema annotation, i.e., the same as the original Spider dataset. In this way, we could utilize multiple schema annotations simultaneously, without changing the model input format.
The main advantage of this method is that it does not require additional training, and could apply to existing models trained without synonym substitution questions. Annotating multiple schema words could be done automatically or manually, and we compare them in Section 4.

Adversarial Training
Motivated by the idea of adversarial training that can improve the robustness of machine learning models against adversarial attacks (Madry et al., 2018;Morris et al., 2020), we implement adversarial training using the current open-source SOTA model RAT-SQL (Wang et al., 2020). We use the BERT-Attack model (Li et al., 2020) to generate adversarial examples, and implement the entire training process based on the TextAttack framework (Morris et al., 2020). TextAttack provides 82 pre-trained models, including word-level LSTM, word-level CNN, BERT-Attack, and other pre-trained Transformer-based models.
We follow the standard adversarial training pipeline that iteratively generates adversarial examples, and trains the model on the dataset augmented with these adversarial examples. When generating adversarial examples for training, we aim to generate samples that align with the Spider-Syn principles, rather than arbitrary adversarial perturbations. We describe the details of adversarial example generation below.

Generating Adversarial Examples
We  word 'head' in 'the head of a department' and 'the head of a body' should correspond to different synonyms. Making such distinctions requires an analysis of the entire sentence, since the keywords' positions may not be close, such as that the word 'head' and 'department' are not close in 'Give me the info of heads whose name is Mike in each department'.

BERT-Attack
In addition to the original question, we add extra domain information into the BERT-Attack model, as shown in Figure 5. Without the domain information, on the right side of the Figure 5, the BERT-Attack model conjectures the word 'head' represent the head of a body, since there are multiple feasible interpretations for the word 'head' if you only look at the question. To eliminate the ambiguity, we feed questions with its domain information into the BERT-Attack model, as shown on the left side of the Figure 5.
Instead of using schema annotations, we select several other questions from the same domain as domain information. These questions should contain the schema item words we plan to replace, and other distinct schema item words in the same domain. The benefits of using sentences instead of schema annotations as domain information include: 1) avoiding many unrelated schema annotations, which could include hundreds of words; 2) the sentence format is closer to the pre-training data of BERT. As shown on the left side of the Figure 5, our method improves the quality of data generation.
Since our work focuses on the synonym substitution of schema item words, we make two additional constraints to limit the generation of adversarial examples: 1) only words about schema items and cell values can be replaced; and 2) do not replace the reserved words discussed in Section 2.2. These constraints make sure that the adversarial examples only perform the synonym substitution for words related to database tables.

Experimental Setup
We compare our approaches against baseline methods on both the Spider (Yu et al., 2018b) and Spider-Syn development sets. As discussed in Section 2.1, the Spider test set is not publicly accessible, and thus Spider-Syn does not contain a test set. Both Spider and Spider-Syn contain 7000 training and 1034 development samples respectively, where there are 146 databases for training and 20 for development. The SQL queries and schema annotations between Spider and Spider-Syn are the same; the difference is that the questions in Spider-Syn are modified from Spider by synonym substitution. Models are evaluated using the official exact matching accuracy metric of Spider.
We first evaluate open-source models that reach competitive performance on Spider: GNN (Bogin et al., 2019a), IRNet (Guo et al., 2019) and RAT-SQL (Wang et al., 2020), on the Spider-Syn development set. We then evaluate our approaches with RAT-SQL+BERT model (denoted as RAT-SQL B ) on both Spider-Syn and Spider development set.
We examine the robustness of following approaches for synonym substitution:  (Guo et al., 2019) 53.2% 28.4% RAT-SQL + SPR (Wang et al., 2020) 62.7% 33.6% RAT-SQL B + SPR (Wang et al., 2020)     to exploit the worst-case attacks of text-to-SQL models, compared to Spider, the performance of all models has clearly dropped by about 20% to 30% on Spider-Syn. Using BERT for input embedding suffers less performance degradation than models without BERT, but the drop is still significant. These experiments demonstrate that training on Spider alone is insufficient for achieving good performance on synonym substitutions, because the Spider dataset only contains a few questions with synonym substitution.
To obtain a better understanding of prediction results, we compare the F1 scores of RAT-SQL B +SPR on different SQL components on both the Spider and Spider-Syn development set. As shown in Table 2, the performance degradation mainly comes from the components including schema items, while the decline in the 'KEY-WORDS' and the 'AND/OR' that do not include schema items is marginal. This observation is consistent with the design of Spider-Syn, which focuses on the substitution of schema item words. Table 3 presents the results of RAT-SQL B trained with different approaches. We focus on RAT-SQL B since it achieves the best performance on both Spider and Spider-Syn, as shown in Table 1. Our MAS approaches significantly improve the performance on Spider-Syn, with only 1-2% performance degradation on the Spider. With ManualMAS, we see an accuracy of 62.6%, which outperforms all other approaches evaluated on Spider-Syn.

Comparison of Different Approaches
We compare the result of RAT-SQL B trained on Spider (SPR) as a baseline with other approaches. RAT-SQL B trained on Spider-Syn (SPR SYN ) obtains 11.7% accuracy improvement when evaluated on Spider-Syn, while only suffers 1.9% accuracy  Table 4: Exact match accuracy on the worst-case development sets generated by ADV GLOVE and ADV BERT . All approaches use the RAT-SQL B model. drop when evaluated on Spider. Meanwhile, our adversarial training method based on BERT-Attack (ADV BERT ) improves the accuracy by 10.3% on Spider-Syn. We observe that ADV BERT performs much better than adversarial training based on GLOVE (ADV GLOVE ), and we provide more explanation in Section 4.4. Both of our multiple annotation methods (ManualMAS and AutoMAS) improve the baseline model evaluated on Spider-Syn. The performance of ManualMAS is better because the synonyms in ManualMAS are exactly the same as the synonym substitution in Spider-Syn. We discuss more results about multi-annotation selection in Section 4.5.

Evaluation on Adversarial Attacks
Observing the dramatic performance drop on Spider-Syn, we then study the model robustness under worst-case attacks. We use the adversarial examples generation module in ADV GLOVE and ADV BERT to attack the RAT-SQL B +SPR to generate two worst-case development sets. Table 4 presents the results on two worst-case development sets. The ADV GLOVE and ADV BERT attacks cause the accuracy of RAT-SQL B +SPR to drop by 31.7% and 20.9%, respectively. RAT-SQL B +SPR+AutoMAS achieve the best performance on defending the ADV GLOVE attack. Because the annotations in AutoMAS cover the synonym substitutions generated by ADV GLOVE . The relation between AutoMAS and ADV GLOVE is similar to that between Manual-MAS and Spider-Syn. Similarly, ManualMAS helps RAT-SQL B +SPR get the best accuracy as shown in Table 3.
As to ADV BERT attack, RAT-SQL B +ADV BERT outperforms other approaches. This result is not surprising, because RAT-SQL B +ADV BERT is trained based on defense ADV BERT attack. How-ever, why does RAT-SQL B +ADV GLOVE perform so poorly in defending ADV GLOVE attack?
We conjecture that this is because the word embedding from BERT is based on the context: if you replace a word with a so-called synonym that is irrelevant to the context, BERT may give this synonym a vector with low similarity to the original. In the first example of Table 6, ADV GLOVE replaces the word 'courses' with 'trajectory'. We observe that, based on the cosine similarity of BERT embedding, the schema item most similar to 'trajectory' changes from 'courses' to 'grade conversion'. This problem does not appear in the Spider-Syn and ADV BERT examples, and some ADV GLOVE examples do not have this problem, such as the second example in Table 6. Some examples reward the model for finding the schema item that is most similar to the question token, while others penalize this pattern, which causes the model to fail to learn. Thus the model with ADV GLOVE neither defends against ADV GLOVE attack nor even obtains good performance on the Spider.

Ablation Study
To analyze the individual contribution of our proposed techniques, we have run some additional experiments and show their results in Table 5. Specifically, we use RAT-SQL B +SPR, RAT-SQL B +SPR SYN , RAT-SQL B +SPR SPR&SYN , and RAT-SQL B +ADV BERT as base models, then we apply different schema annotation methods to these model and evaluate their performance in different development sets. Note that all base models use the Spider original schema annotations.
First, for all base models, we found that MAS consistently improves the model performance when questions are modified by synonym substitution. Specifically, when evaluating on Spider-Syn, using ManualMAS achieves the best performance, because the ManualMAS contains the synonym substitutions of Spider-Syn. Meanwhile, when evaluating on worst-case adversarial attacks, Au-toMAS mostly outperforms ManualMAS. Considering that the AutoMAS is automatically generated, AutoMAS would be a simple and efficient way to improve the robustness of text-to-SQL models.

Further Discussion on MAS
ManualMAS utilizes the same synonym annotations on Spider-Syn, the same relationship as Au-toMAS with ADV GLOVE , and we design this mechanism to demonstrate the effectiveness of MAS in   an ideal case. By showing the superior performance of ManualMAS on Spider-Syn, we confirm that the failure of existing models on Spider-Syn is largely because they rely on the lexical correspondence, and MAS improves the performance by repairing the lexical link. Besides, MAS has the following advantages: • Compared to adversarial training, MAS does not need any additional training. Therefore, by including different annotations for MAS, the same pre-trained model could be applied to application scenarios with different requirements of robustness to synonym substitutions. • MAS could also be combined with existing defenses, e.g., on adversarially trained models, as shown in our evaluation.
We add the evaluation on the combination of MAS with GNN and IRNet respectively, shown in Table 7. The conclusions are similar to RAT-SQL: (1) MAS significantly improves the performance on Spider-Syn, and ManualMAS achieves the best performance.
(2) AutoMAS also considerably improves the performance on adversarial attacks.

Related Work
Text-to-SQL translation. Text-to-SQL translation has been a long-standing challenge, and various benchmarks are constructed for this task (Iyer et al., 2017;Ana-Maria Popescu et al., 2003;Tang and Mooney, 2000;Giordani and Moschitti, 2012;Li and Jagadish, 2014;Yaghmazadeh et al., 2017;Zhong et al., 2017;Yu et al., 2018b). In particular, most recent works aim to improve the performance on Spider benchmark (Yu et al., 2018b), where models are required to synthesize SQL queries with complex structures, e.g., JOIN clauses and nested queries, and they need to generalize across databases of different domains. Among various model architectures (Yu et al., 2018a;Bogin et al., 2019a;Guo et al., 2019;Zhang et al., 2019b;Bogin et al., 2019b;Wang et al., 2020), latest state-ofthe-art models have implemented a schema linking method, which is based on the exact lexical matching between the NL question and the table schema items (Guo et al., 2019;Bogin et al., 2019a;Wang et al., 2020). Schema linking is essential for these models, and causes a huge performance drop when  removing it. Based on this observation, we investigate the robustness of such models to synonym substitution in this work.
Data augmentation for text-to-SQL models.
Existing works have proposed some data augmentation and adversarial training techniques to improve the performance of text-to-SQL models. Xiong and Sun (2019) propose an AugmentGAN model to generate samples in the target domain for data augmentation, so as to improve the cross-domain generalization. However, this approach only supports SQL queries executed on a single table, e.g., Wik-iSQL. Li et al. (2019) propose to use data augmentation specialized for learning the spatial information in databases, which improves the performance on single-domain GeoQuery and Restaurants datasets. Some recent works study data augmentation to improve the model performance on variants of existing SQL benchmarks. Specifically, Radhakrishnan et al. (2020) focus on search-style questions that are short and colloquial, and Zhu et al. (2020) study adversarial training to improve the adversarial robustness. However, both of them are based on WikiSQL. Zeng et al. (2020) study the model robustness when the NL questions are untranslatable and ambiguous, where they construct a dataset of such questions based on the Spider benchmark, and perform data augmentation to detect confusing spans in the question. On the contrary, our work investigate the robustness against synonym substitution for cross-domain text-to-SQL translation, supporting complex SQL structures.
Synonym substitution for other NLP problems.
The study of synonym substitution can be traced back to the 1970s (Waltz, 1978;Lehmann and Stachowitz, 1972). With the rise of machine learning, synonym substitution is widely used in NLP for data augment and adversarial attacks (Rizos et al., 2019;Wei and Zou, 2019;Ebrahimi et al., 2018;Alshemali and Kalita, 2020;Ren et al., 2019). Many adversarial attacks based on synonym substitution have successfully compromised the performance of existing models (Alzantot et al., 2018;Zhang et al., 2019a;Ren et al., 2019;Jin et al., 2020). Recently, (Morris et al., 2020) integrate many above works into their TextAttack framework for ease of use.

Conclusion
We introduce Spider-Syn, a human-curated dataset based on the Spider benchmark for evaluating the robustness of text-to-SQL models for synonym substitution. We found that the performance of previous text-to-SQL models drop dramatically on Spider-Syn, as well as other adversarial attacks performing the synonym substitution. We design two categories of approaches to improve the model robustness, i.e., multi-anotation selection and adversarial training, and demonstrate the effectiveness of our approaches.