On Robustness of Prompt-based Semantic Parsing with Large Pre-trained Language Model: An Empirical Study on Codex

Semantic parsing is a technique aimed at constructing a structured representation of the meaning of a natural-language question. Recent advances in language models trained on code have shown superior performance in generating these representations compared to language models trained solely on natural language text. The existing fine-tuned neural semantic parsers are vulnerable to adversarial attacks on natural-language inputs. While it has been established that the robustness of smaller semantic parsers can be enhanced through adversarial training, this approach is not feasible for large language models in real-world scenarios, as it requires both substantial computational resources and expensive human annotation on in-domain semantic parsing data. This paper presents the first empirical study on the adversarial robustness of a prompt-based semantic parser based on CODEX, a stateof-the-art (SOTA) language model trained on code. Our results demonstrate that the large language model of code is vulnerable to carefully crafted adversarial examples. To overcome this challenge, we propose methods for enhancing robustness without requiring substantial amounts of labelled data or intensive computational resources.


Introduction
Semantic parsing is a technique that transforms natural-language utterances (NLs) into machinereadable logical forms (LFs) and has been widely applied in various research fields, such as code generation, question-answering systems, and dialogue systems (Kamath and Das, 2018). Most current state-of-the-art semantic parsers are deep-learning models trained in a supervised manner using indomain data. However, this approach requires a large amount of in-domain semantic parsing data, which can be costly to obtain (Bapna et al., 2017). * corresponding author To address this issue, prompt-based semantic parsers based on large pre-trained language models, such as Codex  and GPT-J (Wang and Komatsuzaki, 2021), have become a new choice for semantic parsing applications. Prompt-based semantic parsers learn to solve a new task by in-context learning, instructing the parsers to generate correct LFs by constructing the prompt with a few demonstration examples. Such a method can significantly lower the cost of annotations by including only a few exemplars in the prompt and achieve comparable results to fully-supervised semantic parsers (Shin and Van Durme, 2022).
Recent studies (Huang et al., 2021;Pi et al., 2022) show that fully-supervised semantic parsers are vulnerable to adversarial attacks, which perturb input sentences into their semantic equivalent adversaries to mislead models to produce attackerdesired outputs. Hence, to mitigate such attacks, various adversarial training methods (Tramer and Boneh, 2019;Shafahi et al., 2019;Ganin et al., 2016;Shafahi et al., 2020) have been proposed to improve the adversarial robustness of the semantic parsers. In light of this, two main questions naturally arise: (1) Do prompt-based semantic parsers based on large pre-trained language models also suffer from adversarial attacks? (2) If so, how can we improve the robustness of the large promptbased semantic parsers?
To address the former question, we evaluate the prompt-based semantic parsers on several evaluation sets built by different perturbation approaches mentioned in the AdvGLUE  dataset. Adopting the adversarial evaluation metrics proposed by Huang et al. (2021), it is found that the prompt-based semantic parsers are vulnerable to various types of adversarial attacks.
According to the experimental results from the first step, we perform a three-fold experiment to answer the latter questions. The first aspect of the study aims to determine if the inclusion of additional examples within the prompt during incontext learning improves the robustness of promptbased parsers. This hypothesis is based on prior research that has demonstrated that the increase in the size of the training data results in an enhancement of robustness in fully-supervised models (Pang et al., 2019). The second part of the study aims to determine if the integration of few-shot adversarial examples within prompts can improve the robustness of Codex. This was based on the observation that conventional adversarial training methods often include adversarial examples within the training set (Miyato et al., 2016;Tramer and Boneh, 2019). Finally, the third part of the study aims to evaluate if sampling methods other than random sampling can select more effective examples that improve the robustness of prompt-based parsers.
In this work, we perform a series of experiments to probe CODEX, a large pre-trained model trained on code, on two semantic parsing benchmarks, GeoQuery (Zelle and Mooney, 1996) and Scholar (Iyer et al., 2017). Our key findings from the above experiments are as follows: • Prompt-based semantic parsers are vulnerable to adversarial examples, particularly the ones crafted by sentence-level perturbations.
• In-context learning with more demonstration examples in the prompt can improve the indomain robustness of prompt-based parsers.
• Augmenting the prompt with adversarial examples has limited effect in improving the robustness of prompt-based parsers.
• The few-shot example sampling strategy with higher language complexity can result in stronger robustness for the prompt-based parsers.

Related Work
Prompt-based Learning. Prompt-based learning is an alternative approach to supervised learning that aims to reduce the reliance on large humanannotated datasets . Unlike traditional supervised models, which estimate the probability of an output given an input text, promptbased learning models estimate the probability of the text directly. This is achieved by applying prompt functions to modify the input text into various prompt templates with unfilled slots. By filling these slots, various Natural Language Processing (NLP) tasks can be completed, such as common-sense reasoning (Kojima et al., 2022), self-rationalization (Marasović et al., 2021), and text style transfer (Suzgun et al., 2022). The development of prompt-based methods has enabled zeroshot and few-shot learning in a variety of artificial intelligence domains (Ramesh et al., 2021;Yang et al., 2022;Sanghi et al., 2022). Recent research has also evaluated the capabilities of few-shot prompt-based learning for semantic parsing (Shin and Van Durme, 2022;Roy et al., 2022a;Drozdov et al., 2022 (Hosseini et al., 2017;Ebrahimi et al., 2018;Belinkov and Bisk, 2018;Gao et al., 2018;Eger et al., 2019;Boucher et al., 2022), sentencelevel rewriting (Iyyer et al., 2018;Ribeiro et al., 2018;Zhao et al., 2018), and adversarial word substitutions (Alzantot et al., 2018;Liang et al., 2018;Zhang et al., 2019).
There has been an increasing interest in defending against adversarial attacks in large language models via adversarial training (Yi et al., 2021;Bartolo et al., 2021;Guo et al., 2021). Adversarial training involves incorporating adversarial examples in the training set, thus making the model robust to such attacks. However, adversarial training can sometimes negatively impact the generalization ability of the neural models (Raghunathan et al., 2019;Min et al., 2021).

Robustness Evaluation for Prompt-based Semantic Parsing
This section gives an overview of our evaluation framework, including the methods of constructing the evaluation corpora and the evaluation metrics to evaluate the robustness of the prompt-based semantic parser.

Construction of the Evaluation Corpus
A robust prompt-based semantic parser should be able to parse both the utterances and their adversarial counterparts into correct LFs. As proposed by Huang et al. (2021), an adversary of an utterance for a semantic parser is defined as i) an utterance with the same semantic meanings as the original one given the human judgment and ii) an utterance on which the semantic parser cannot produce correct LF. Therefore, to evaluate the robustness of prompt-based semantic parsers, we craft the robustness evaluation sets by perturbing the original utterances in existing benchmark datasets with multiple adversarial perturbation methods. Such perturbations should not alter the semantics of the original utterances. Each example in a robustness evaluation set is a perturbed utterance paired with its ground-truth LF. Next, we introduce the details of each perturbation method and how we guarantee the perturbations do not change the semantics. Table 1 illustrates some meaning-preserved utterances after perturbation in the robustness evaluation set of GeoQuery based on different perturbation methods. More examples can be found in Appendix B.

Adversarial Perturbations
Following the principles as in  to design adversarial attacks, we perform five wordlevel perturbations and two sentence-level perturbations to generate seven robustness evaluation sets for the standard evaluation set in each benchmark.
• Typo-based (TB) uses TextBugger  to replace two words in each utterance with the typos.
• Random Deletion (RD) randomly deletes two words in the utterance.
• Random Swap (RS) swaps the positions of two random words in each utterance.
• Context-aware Substitution (CS) leverages RoBERTa (Liu et al., 2019) to substitute two random words with their synonyms.
• Context-aware Insertion (CI) inserts two most probable words selected by RoBERTa at two random positions in each utterance.
• Distraction-based (DB) appends interrogation statements to the end of each NL, inspired by StressTest (Naik et al., 2018). Specifically, we design the following interrogation statements: "who is who; what is what; when is when; which is which; where is where", in which the selected interrogative words are more likely to appear in the utterance.

Data Filtering
In order to ensure that the perturbed examples preserve the meaning of the original NL, we design a two-stage evaluation process: Step1: We first generate 20 adversarial examples against the original NL for each perturbation method and choose the top 10 candidates ranked based on text similarity scores between the original and the perturbed ones, which are calculated by Sentence-BERT (Reimers and Gurevych, 2019).
Step2: We engage human experts to select the best one among the 10 adversarial candidates produced in Step1.

Evaluation Metrics
Since the output LFs of the prompt-based language models may not follow the same naming convention (Shin et al., 2021;Shin and Van Durme, 2022) as the ground truth, previous string-based evaluation metrics, including BLEU (Papineni et al., 2002) and Exact Match (Poon and Domingos, 2009), are not suitable for prompt-based semantic parsers. Therefore, we follow Rajkumar et al. (2022) to report the execution accuracy, which is based solely on the execution correctness of the LFs on the test sets, for the purpose of robustness evaluation.
Following Huang et al. (2021), we report the experiment results with three variants of execution accuracy, namely standard accuracy, perturbation accuracy and robust accuracy: • Standard Accuracy is measured on the standard (original) test sets.
• Perturbation Accuracy tests the performance of the model on perturbed test sets.
• Robust Accuracy is defined as n/|R eval |. R eval denotes a subset of the perturbed test sets, and n is the number of the utterances in R eval that are parsed correctly. More specifically, R eval consists of the examples whose counterparts before perturbation are parsed correctly. Intuitively, Robust Accuracy estimates the quantity of cases that a parser can successfully parse before perturbation but cannot do so after perturbation, and hence shows the robustness of the parsers against adversarial perturbation.

Improving Robustness of Prompt-based Semantic Parsers
Instead of predicting the LF conditioned on the input utterance, large language models such as CODEX could learn to solve a specific task by incontext learning. During in-context learning, the parser predicts the LF conditioned on a prompt which consists of a small list of utterance-LF pairs to demonstrate the semantic parsing task and, optionally, a Given the difference, it is unclear whether incontext learning could improve the robustness of the parser as the conventional supervised training. In this paper, we conduct the first investigation on in-context learning for model robustness. More specifically, we examine the impact of variants of in-context learning and sampling methods on parser robustness.

Standard In-context Few-shot Learning
In our setting, given an input utterance x, the pretrained language model P (·; θ) predicts the LF y ′ conditioned on the prompt, which consists of a set , and a table schema T : For the few-shot setting, the number of demonstration examples N is limited by a budget size.

Adversarial In-context Few-shot Learning
In adversarial in-context learning, we include the perturbed adversarial examples, M adv , in the demonstration examples:

In-context Few-shot Selection
Current in-context learning assumes there is an example pool from where they can select prompting examples. However, most of the works only randomly pick examples from the pools. We argue that the way to select the examples might deeply impact the robustness of the prompt-based semantic parser. Therefore, we examine various strategies to select in-context few-shot examples.
Random Sampling (Random). We randomly sample N utterances from the example pool.
Confidence-based Sampling (Confidence) (Duong et al., 2018). We score each utterance with the confidence of the parser on the predicted LF given the utterance and the table schema. Then we select the ones with the lowest parser confidence scores 2 .
Diversity-based Sampling. Following , we partition the utterances in the utterance pool into N clusters with the K-means (Wu, 2012) clustering algorithm and select the example closest to the cluster centers. We measure the edit distance (Cluster-ED) (Wagner and Fischer, 1974), and Euclidean distances using utterance features of TF-IDF (Cluster-TF-IDF) (Anand and Jeffrey David, 2011), or Contextual Word Embedding (Cluster-CWE) encoded by Sentence-BERT (Reimers and Gurevych, 2019), between each pair of utterances for K-means.
Perplexity-based Sampling (Sen and Yilmaz, 2020). We score each utterance with the perplexity of GPT-2 on this utterance. Then we select the utterances with the highest (PPL. Asc) and lowest (PPL. Desc) perplexity scores, respectively. Prompt-based Semantic Parser. We choose CODEX  as the representative prompt-based semantic parser for our evaluation.
In recent studies, CODEX has performed comparably via in-context few-shot semantic parsing to the SOTA-supervised trained neural semantic parsers (Shin and Van Durme, 2022;Roy et al., 2022b;Drozdov et al., 2022) in terms of execution accuracy.
To examine the vulnerability of large promptbased semantic parsers against adversarial examples, we choose the code-davinci-002 version of CODEX as it is the most powerful variant among all CODEX models, with 175B parameters. In our experiments, we sample a maximum of 200 tokens from CODEX with the temperature set to 0, with the stop token to halt generation.
Prompts. In this work, we adopt the prompt design of Create Table + Select X as presented in Rajkumar et al. (2022), which has been shown to be effective for semantic parsing using static prompting 3 .
The prompt for semantic parsing on CODEX consists of CREATE TABLE commands, including specifications for each table's columns, foreign key declarations, and the results of executing a SELECT * FROM T LIMIT X query on the tables via the column headers. As described in Section 4.3, we select NL-LF pairs as in-context few-shot examples from the train sets.
To guide the prompt-based semantic parser, we also include the textual instruction of "Using valid SQLite, answer the following questions for the tables provided above." as proposed by Rajkumar et al. (2022).

Research Questions and Discussions
Our experimental results answer the following four research questions (RQs) related to the robustness of CODEX.

RQ1: How vulnerable is the prompt-based semantic parser to adversarial examples?
Settings. To answer RQ1, we evaluate the standard accuracy and perturbation accuracy of CODEX on GeoQuery and Scholar test sets through zero-shot learning.
Results. The zero-shot parsing performances of CODEX are shown in Table 2 Table 2: Results of perturbation accuracy (Pert. Acc.) and standard accuracy (Std. Acc.) of zero-shot performance on GeoQuery and Scholar. The zero-shot prompt only contain the table information and initial semantic parsing instruction. Perturbation accuracy is calculated based on each perturbation method.
tion is that CODEX is more vulnerable to sentencelevel perturbations than to word-level perturbations, as indicated by the more significant performance gaps between standard and perturbed accuracies on the sentence-level perturbed test sets.  observed that neural language models are vulnerable to human-crafted adversarial examples where there are complex linguistic phenomenons (e.g., coreference resolution, numerical reasoning, negation). We observe that the rewriting model trained on human paraphrase pairs also introduces such complex linguistic phenomenons. With respect to the word-level perturbations, CODEX is most robust to typo-based perturbations, which is surprising as  shows typo-based perturbation is the most effective attack method for large language models like BERT (Devlin et al., 2019) in the evaluation of natural language understanding tasks. However, utterances with typos drop only 3% of the accuracy of CODEX. Random Deletion is also less effective than the other word-level methods, consistent with the observations by Huang et al. (2021) on the fullysupervised semantic parsers. This phenomenon can be attributed to the fact that Random Deletion primarily makes minor modifications to the standard NL utterances, as this method often involves removing non-functional words such as articles, for example, "the" and "a." Although CODEX is pre-trained on a considerably large dataset, it does not show robustness on the in-domain tasks. We conjecture that the reason is that zero-shot CODEX has not yet learned any indomain knowledge on GeoQuery or Scholar. So in RQ2, we would address whether in-domain examples would improve the robustness of CODEX.   RQ2: Does standard in-context few-shot learning improve the robustness?
Settings. We respectively select up to 50 and 10 examples from GeoQuery and Scholar train sets 4 , with the random sampling strategy, to construct prompts for parsers. Then, for each few-shot learning experiment, we measure standard accuracy, perturbation accuracy and robust accuracy on our various perturbed test sets.

Results
. Tables 3 and 4 show the performance of standard in-context few-shot learning on the robustness evaluation sets perturbed by different methods. We observe that more standard examples in the prompt can evidently improve the robust accuracy of CODEX, which demonstrates the effectiveness of standard in-context few-shot learning in improving the robustness of semantic parsing. Although it performs slightly worse on the test sets perturbed by typo-based methods than the one perturbed by the random deletion in GeoQuery, we argue that this is due to the performance variance, which does not necessarily hurt the model robustness.
The performance gap between perturbation accuracy with standard accuracy is enlarged when the number of in-context shots increases. However, the robust accuracy grows slowly. This indicates that improving the generalization ability of the parser does not necessarily mean the improvement of the robustness. The trade-off between standard and robust accuracies is a long-standing problem in adversarial training. Raghunathan et al. (2019) shows that increasing the training sample size could eliminate such a trade-off. Our experiments demonstrate that in-context learning follows similar patterns as supervised adversarial training. It can be observed that both objectives can be improved with a limited number of examples when compared to the zeroshot parser. However, the extent of improvement varies.
Takeaways. With more standard in-context examples, few-shot CODEX can be guided to achieve better robustness and standard execution performance.
RQ3: Does adversarial in-context learning improve robustness? Settings. In this work, we present the experimental results of CODEX on both GeoQuery and Scholar datasets, using 10 and 5 in-context examples, respectively. In order to assess the robustness of CODEX through adversarial in-context learning, we first augment the standard few-shot examples by incorporating examples whose utterances have been perturbed using various perturbation methods. Subsequently, for each set of augmented examples, we calculate the average robust accuracy of CODEX based on the average of the  parser robust accuracies on all robustness evaluation sets.
Results. The experimental results of the various perturbation strategies applied to the in-context few-shot examples are presented in Table 5. While the approach of supervised adversarial training has been widely regarded as an effective means of enhancing the robustness of machine learning models, the results indicate that on both GeoQuery and Scholar, the robust accuracies are only marginally improved through the application of adversarial in-context learning. Previous studies (Raghunathan et al., 2019;Huang et al., 2021) have pointed out that supervised adversarial training can sometimes result in a decrease in standard accuracy, even as it improves robust accuracy. However, the results of adversarial in-context learning diverge from this trend, with significant improvement in standard accuracy, from 23% to more than 33%, observed on Scholar, while robust accuracy only experiences marginal improvement. These observations indicate that adversarial in-context learning represents a distinct approach from supervised adversarial training in terms of enhancing the robustness of the model. Furthermore, the results suggest that simply incorporating adversarial examples into the prompt has a limited impact on the robustness of parsers, in contrast to supervised adversarial training. Of all the perturbation strategies analyzed, the results indicate that CODEX achieves the best performance in terms of both standard and robust accuracy through the application of RB adversarial in-context learning, but experiences the worst performance through TB adversarial in-context learning. The hypothesis is that RB produces utterances   Takeaways. The robustness of few-shot CODEX can be marginally improved by adversarial incontext learning without significant negative impacts on standard performances.

RQ4
: What is the impact of sampling techniques on the robustness of parsers that utilize standard in-context few-shot learning?
Settings. In order to compare the influence of different sampling methods on the robustness of the model, we vary standard in-context examples on GeoQuery and Scholar with all 7 strategies aforementioned in Section 4.3. We choose the 50shot setting for GeoQuery and 10-shot setting for Scholar.
Results and Takeaways. We present standard accuracies in Table 6 when varying the sampling methods for the few-shot example selection. We first observe that different sampling strategies impact the robust and standard performance of the CODEX. Overall, the Cluster methods, which diversify the features of the selected utterances, perform better than the other sampling methods. On the other hand, PPL. Desc sampling method performs consistently poorly than the other sampling methods. In brief, we conclude that CODEX is sensitive to the few-shot example sampling strategies.
RQ4-Ablation: Why does CODEX react differently to various sampling strategies?
Settings. The findings of RQ 1 and RQ 3 indicate that linguistic complexity has an impact on the performance of CODEX. As a result, the results of RQ 4 may also be influenced by linguistic complexity. To further explore this correlation, three lexical diversity metrics, Type-Token Ratio (TTR) (Templin, 1957), Yule's I (Yule, 1944), and Measure of Textual Lexical Diversity (MTLD) (Mc-Carthy, 2005), are adopted to measure the lexical diversity of the selected NLs from GeoQuery and Scholar. The TTR is defined as the ratio of the number of unique tokens, also known as types, to the total number of tokens. The TTR is used as an indicator of lexical diversity, with a higher TTR indicating higher lexical diversity. Yule's Characteristic Constant (Yule's K) is a measure of text consistency that considers vocabulary repetition. Yule's K and its inverse, Yule's I, are considered more robust to variations in text length than the Type-Token Ratio (TTR). MTLD is computed as the average number of words in a text required to maintain a specified TTR value.
Results and Takeaways. Table 7 presents the lexical diversity of each set of NLs sampled by different approaches. The diversity scores obtained from the three metrics align with the performance of CODEX as presented in Table 6. For instance, the three metrics indicate that the examples selected using the Cluster-TF-IDF method achieve higher lexical diversity compared to those selected through the other methods. Additionally, the Cluster-TF-IDF method also produces the highest results in terms of robust and standard accuracy for CODEX. Thus, it can be inferred that an increase in the lexical diversity of the few-shot examples leads to an improvement in the robust and standard accuracy of CODEX.

Conclusion
This study examines the robustness of semantic parsers in the context of prompt-based few-shot learning. To achieve this objective, robustness evaluation sets were curated to evaluate the robustness of the prompt-based semantic parser, CODEX. The research aims to identify methods to improve the robustness of CODEX. The results of our comprehensive experiments demonstrate that even the prompt-based semantic parser based on a large pretrained language model is susceptible to adversarial attacks. Our findings also indicate that various forms of in-context learning can improve the robustness of the prompt-based semantic parser. It is believed that this research will serve as a catalyst for future studies on the robustness of promptbased semantic parsing based on large language models.

Limitations
In this study, we examine the robustness of the prompt-based semantic parser, CODEX, by focusing on the impact of prompt design on its execution performance. However, there is a need for future research to investigate more alternative adversarial training strategies for prompt-based semantic parsers in order to advance this field. In addition, our focus is limited to text-to-SQL tasks, and we encourage further investigation into the robustness of semantic parsers across different datasets and LFs. Despite these limitations, we emphasize the importance of exploring more effective prompt design in order to enhance the robustness of prompt-based semantic parsers, including CODEX, which shows non-negotiable vulnerability.

A Experiments
Datasets. The GeoQuery dataset contains 877 pairs of NL-LF pairs about U.S. geographical information. On the other hand, Scholar contains pairs of NL-SQL regarding information about academic publications. Finegan-Dollak et al. (2018) proposed a dataset split for evaluating the compositional generalization capability of semantic parsers on several datasets, including GeoQuery and Scholar.
The proposed split, referred to as the query-split, presents a more challenging scenario for semantic parsing models. This paper utilizes the query-split, where the two test sets in our experiments include 182 NL-LF pairs from GeoQuery and 315 NL-LF pairs from Scholar, respectively, during the evaluation of the prompt-based semantic parser.
Model Versioning. The version of the code-davinci-002 model referred to in this paper is as of the midpoint of the year 2022.