On Robustness of Neural Semantic Parsers

Semantic parsing maps natural language (NL) utterances into logical forms (LFs), which underpins many advanced NLP problems. Semantic parsers gain performance boosts with deep neural networks, but inherit vulnerabilities against adversarial examples. In this paper, we provide the first empirical study on the robustness of semantic parsers in the presence of adversarial attacks. Formally, adversaries of semantic parsing are considered to be the perturbed utterance-LF pairs, whose utterances have exactly the same meanings as the original ones. A scalable methodology is proposed to construct robustness test sets based on existing benchmark corpora. Our results answered five research questions in measuring the sate-of-the-art parsers’ performance on robustness test sets, and evaluating the effect of data augmentation.


Introduction
Semantic parsing aims to map natural language (NL) utterances into logical forms (LFs), which can be executed on a knowledge base (KB) to yield denotations (Kamath and Das, 2018). At the core of the state-of-the-art (SOTA) semantic parsers are deep learning models, which are widely known to be vulnerable to adversarial samples . This kind of examples is created by adding tiny perturbations to inputs but can severely deteriorate model performance. To the best of our knowledge, despite the popularity of semantic parsing (Kamath and Das, 2018), there is still no published work on studying the robustness of neural semantic parsers against adversarial examples. Therefore, we conduct the first empirical study to evaluate the effect of adversarial examples on SOTA neural semantic parsers. * corresponding author Unlike other disciplines, it is unclear what adversaries are for semantic parsers. For computer vision systems, adversaries are often generated by modifying inputs with imperceptible perturbations. In contrast, a flip of single word or a character in an utterance can significantly change its meaning so that the changes are perceptible by humans. To address this issue, (Michel et al., 2019) argue that adversaries for sequence-to-sequence models should maximally retain meanings after perturbing inputs. However, any meaningchanging utterances are supposed to have different meaning representations. A robust semantic parser should be invariant to meaning-preserving modifications. In light of this, given a semantic parser, we define its adversaries as the perturbed utterances satisfying two conditions: i) they have exactly the same meanings as the original ones according to human judgements; and ii) the parser consistently produces incorrect LFs on them.
Although new evaluation frameworks are proposed for NLP tasks (Xu et al., 2020;Michel et al., 2019), there is no framework designed for assessing robustness of semantic parsers against meaning-preserving adversaries. The current evaluation metrics focus only on standard accuracy, which measures to what degree predictions match gold standards. As pointed out by , it is challenging to achieve both high standard accuracy and high robust accuracy, which measures the accuracy on adversarially perturbed examples. In order to facilitate devising novel methods for robust semantic parsers, it is desirable to develop a semantic parsing evaluation framework considering both measures.
In this work, we propose an evaluation framework for robust semantic parsing. The framework consists of an evaluation corpus and a set of customized metrics. We construct the evaluation corpus by extending three existing semantic parsing benchmark corpora. In order to generate meaningpreserving examples, we apply automatic methods to modify the utterances in those benchmark corpora by paraphrasing and injecting grammatical errors. Among the perturbed examples generated from the test sets, we build meaning-preserving test sets by filtering out the meaning-changing ones using crowdsourcing. The robustness of the semantic parsers is measured by a set of custom metrics, with and without adversarial training methods.
We conduct the first empirical study on the robustness of semantic parsing by evaluating three SOTA neural semantic parsers using the proposed framework. The key findings from our experiments are three-folds: • None of those SOTA semantic parsers can consistently outperform the others in terms of robustness against meaning-preserving adversarial examples; • Those neural semantic parsers are more robust to word-level perturbations than sentence-level ones; • Adversarial training through data augmentation indeed significantly improve the robustness accuracy but can only slightly influence standard accuracy.
The generated corpus and source code are available at https://github.com/shuo956/On-Robustness-of-nerual-smentic-parsing.git 2 Related Work Semantic Parsing The SOTA neural semantic parsers formulate this task as a machine translation problem. They extend SEQ2SEQ with attention (Luong et al., 2015) to map NL utterances into LFs in target languages (e.g., lambda calculus, SQL, Python, etc.). One type of such parsers directly generates sequences of predicates as LFs Lapata, 2016, 2018;Huang et al., 2018). The other type of parsers utilizes grammar rules to constrain the search space of LFs during decoding (Yin and Neubig, 2018;Guo et al., 2019b;Wang et al., 2020). However, neither of the two types of parsers are evaluated against adversarial examples.

Adversarial Examples
The adversarial examples are firstly defined and investigated in computer vision. Adversarial examples in that field are generated by adding imperceptible noise to input images, which lead to false predictions of machine learning models Goodfellow et al., 2014). However, it is non-trivial to add such noise to text in natural language processing (NLP) tasks due to the discrete nature of languages. Minor changes in characters or words may be perceptible to humans and may lead to change of meanings. To date, it is still difficult to reach an agreement on the definition of adversarial examples across tasks. Jia and Liang (2017); Belinkov and Bisk (2017); Ebrahimi et al. (2018); Miyato et al. (2016) add distracting sentences and sequences of random words into text or flip randomly characters or words in input text, which can confuse the models but do not affect the labels judged by humans. In those works, such perturbations are not required to keep the semantics of original text. (Michel et al., 2019) argue that perturbations for SEQ2SEQ tasks should minimize the change of semantic in input text but dramatically alter the meaning of outputs. (Cheng et al., 2020) uses a sentiment classifier to verify whether sentiments of the original utterances is preserved after perturbation while we use crowdsourcing to ensure the meaning remains. Adversarial examples in semantic parsing cannot simply borrow from prior work because a parser does not make errors if it generates a different and correct LF when there is any subtle change in input text leading to change of semantics. In contrast, adversarial examples are supposed to cause parsing errors. Thus, adversarial examples w.r.t. meaning-changing perturbations are not well defined.
There are two types of methods in generating adversarial examples. The white-box methods (Papernot et al., 2016;Ebrahimi et al., 2017;Athalye and Carlini, 2018) assume that the attacker have direct access to model details including their parameters, but the black-box methods assume that attackers have no access to model details except feeding input data and getting outputs from models (Gao et al., 2018;Guo et al., 2019a;Chan et al., 2018;Blohm et al., 2018).
Adversarial Training Adversarial training aims at improving the robustness of machine learning models against adversarial examples (Goodfellow et al., 2014;Miyato et al., 2016;Li et al., 2018). One line of research is to augment the training data with the adversarial examples. However, Ebrahimi et al. (2018) points out that adversarial training may cause the model oversensitive to the adversarial examples. The other approach is to increase model capacity that may improve model's robustness . More techniques regarding adversarial defense can be found in the recent survey (Wang et al., 2019). In semantic parsing, data augmentation methods (Jia and Liang, 2016;Guo et al., 2018) are proposed only to improve performance of models on examples without perturbation but not to improve their robustness.

Evaluation Framework for Robust Semantic Parsing
A robust semantic parser aims to transduce all utterances with or without meaning-preserving perturbations into correct LFs. Formally, let x = x 1 , . . . , x |x| denote a natural language utterance, and y = y 1 , . . . , y |y| be its LF, a semantic parser estimates the conditional probability (denoted by p(y|x)) of an LF y given an input utterance x. A robust parser's predictions are invariant to all x that are generated from x by meaning-preserving perturbations.
For any (x, y), let S perturb (x, y) denote the set of all meaning-preserving perturbations of x that is parsed to y: (1) where B(x) denotes the set of all allowed perturbations of x, and o(x) is an ideal parser that maps an utterance to its LF.
A set of meaning-preserving perburbed examples S perturb (D) w.r.t. a corpus D is the union of all S perturb (x, y) created from D.
An adversarial example w.r.t. a semantic parser is an utterance-LF pair, which is in S perturb (x, y) and is parsed into an incorrect LF by that parser.
Subsequently, an adversary set w.r.t. a semantic parsing corpus D is created by taking the union of all adversary sets created from each example in D.
In the following, we present an evaluation framework for robust semantic parsing, which consists of an evaluation corpus, a set of evaluation metrics, and the corresponding toolkit for evaluating any new parsers.

Construction of the Evaluation Corpus
We construct the evaluation corpus in a scalable manner by combining the existing semantic parsing benchmark corpora. Each of such corpora will be referred to as a domain in the whole corpus. There are a train set, a validation set, and a standard test set in each domain. We perturb the examples in each test set to build a meaning-preserving test set for each domain. More specifically, each example in a meaning-preserving test set is a perturbed utterance paired with its LF before perturbation. In the following, we detail each perturbation method and how we apply the crowdsourcing to remove meaning-changing ones.  Given an utterance, we perturb it by performing four different word-level operations and two sentence-level operations, respectively. Table 1 lists the examples of generated meaningpreserving examples categorized according to the respective generation methods. More details are given below.

Meaning-Preserving Perturbations
Insertion Given an utterance x, we randomly select a word position t ∈ {1, . . . , |x|}. A meaning-preserving example x is created by inserting a function word at position t .
Deletion We randomly remove a function word x t from x.

Substitution (f) Every function word in
x is replaced with a random but different function word to generate a perturbed example.
Substitution (nf) For every non-function word in x, we apply the pretrained language model ELECTRA (Clark et al., 2020) to select top-k candidate words and exclude the original word. We generate a perturbed utterance for each valid candidate word. Since this method may generate utterances far from their original meaning, we subsequently filter those utterances by measuring their semantic similarity with the original ones.
Specifically, we apply the SOTA sentence similarity model Sentence-Bert (Reimers and Gurevych, 2019) to compute similarity scores for each generated utterance resulting in only the n highest scored utterances.
Back Translation Inspired by (Lichtarge et al., 2019), we revise utterances using back translation. We apply the Google translation API to translate utterances into a bridge language and then translate them back to English. Russian, French, and Japanese are used as the bridge languages to diversify and maximize the coverage of meaningpreserving perturbations. We select the best translation among all three translations for each original utterance according to Sentence-Bert's scores.
Reordering Similar to back translation, we reorder utterances by the SOTA reordering model SOW-REAP (Goyal and Durrett, 2020). To increase the coverage, we follow the same strategy as Substitution (nf) to generate an extended set of reordered utterances with multiple instances per input utterance. We use Sentence-Bert to encode sentences and select the top-k best ones according to their cosine similarity scores with the original input. The reordered sentences share the same vocabulary as the original one except for the order of words.

Filter Examples by Crowdsourcing
As perturbation operations related to function words rarely alter meanings, we apply three crowdsourcing operations to the utterances perturbed, including Substitution (nf), Reordering, and Back Translation. For each perturbed utterance paired with the original utterance, three turkers at Amazon Mechanical Turk discern any semantic changes by choosing an option out of three the same, different, or not sure. By default, any sentences uncomprehended by a human are regarded as not sure. Finally, we keep only the ones that have the same meaning agreed by at least two turkers. After crowdsourcing, we keep 83%, 82% and 61% of the generated utterances for Substitution (nf), Back Translation and Reordering, respectively.

Evaluation Metrics
Our framework assesses the performance of a semantic parser w.r.t. standard accuracy and robustness metrics. Those robustness metrics indicate how well a semantic parser resists to meaningpreserving perturbations and adversarial attacks. As adversarial training is widely used for adversarial defense and mostly applicable to any neural semantic parsers, this framework supports comparing a wide range of adversarial training methods w.r.t. standard accuracy and robustness metrics.
A training set, a validation set, a standard test set, and a meaning-preserving test set are established in each domain. The first three sets are obtained from the original benchmark corpus, and the meaning-preserving test set is created using the methods described in Sec. 3.1. We will examine the meaning-preserving test set (denoted by S perturb (D)) and its two subsets. We refer to the first subset as the robustness evaluation set (denoted by R eval (D)), where the counterparts before perturbation are parsed correctly. We refer to the second subset as the black-box test set (denoted by B attack (D)), where the loss of a parser to the examples is higher than their counterparts before perturbation. For each target parser, we consider four metrics, including standard accuracy, perturbation accuracy, robust accuracy, and success rate of black-box attack.
Standard accuracy The most widely used metric on semantic parsing (Dong and Lapata, 2018;Yin and Neubig, 2018) to measure the percentage of the predicted LFs that exactly match their gold LFs in a standard test set.
Perturbation accuracy Perturbation accuracy is formally defined as n/|S perturb (D)|, where n denotes the number of correctly parsed examples to their gold LFs in a meaning-preserving test set S perturb (D).
Robust accuracy Robust accuracy is calculated as n/|R eval (D)|, where n denotes the number of examples that are parsed correctly by a parser in a robustness evaluation set R eval (D). Compared to perturbation accuracy, robust accuracy measures the number of examples that a parser can parse correctly before perturbation but fails to get them right after perturbation.
Success rate of black-box attack A blackbox attack example is regarded as the one that increases the loss of a model after perturbation ). Here, the success rate of black-box attack is calculated as n/|B attack (D)|, where n denotes the number of examples that are parsed incorrectly by a parser in the blackbox test set B attack (D). White-box attacks require model specific implementation to generate adversarial examples, thus we leave the corresponding evaluation to the developers of semantic parsers.
The four metrics are computed to evaluate the efficacy of an adversarial training method. We inspect whether the metrics increase or decrease post-training and to what degree. An effective adversarial training method is expected to find a good trade-off between standard accuracy and robust accuracy.
Last but not least, all evaluation metrics are implemented with easy-to-use APIs in our toolkit. Our toolkit supports easy evaluation of a semantic parser and provides source code to facilitate integrating additional semantic parsing corpora.

Experiments
In this section, we present the first empirical study on robust semantic parsing.

Experimental Setup
Parsers We consider three SOTA neural semantic parsers -SEQ2SEQ with attention (Luong et al., 2015), COARSE2FINE (Dong and Lapata, 2018), and TRANX (Yin and Neubig, 2018). COARSE2FINE is the best performing semantic parser on the standard splits of GEOQUERY and ATIS. TRANX reports standard accuracy on par with COARSE2FINE and employs grammar rules to ensure validity of outputs.
Datasets Our evaluation corpus is constructed by extending three benchmark corpora -GEOQUERY (Zelle and Mooney, 1996), ATIS (Dahl et al., 1994), and JOBS. GEOQUERY contains 600 and 280 utterance-LF pairs to express the geography information in the training and test set, respectively. ATIS consists of 4434, 491, and 448 examples about flight booking in the training, validation, and test sets, respectively. And JOBS includes 500 and 140 pairs about job listing in the training and test set, respectively.
Adversarial Training Methods We apply three adversarial training methods to the parsers. We evaluate whether the three adversarial training methods could improve semantic parsers' robustness against the meaning-preserving examples generated by the word-level and sentence-level operations. The corresponding three adversarial training methods are as follows: Fast Gradient Method (Miyato et al., 2016) Fast Gradient Method (FGM) adds small perturbations to the word embeddings and train the semantic parsers with the perturbed embeddings. The perturbations are scaled gradients w.r.t. the input word embeddings.
Projected Gradient Descent (Madry et al., 2017) Projected Gradient Descent (PGD) adds small perturbations to the word embeddings as well. Instead of calculating a single step of gradients, PGD accumulates the scaled gradients for multiple iterations to generate the perturbations.

Meaning-preserving Data Augmentation
Meaning-preserving Data Augmentation (MDA) augments the original training data with the meaning-preserving examples generated by the word-level and sentence-level operations. We randomly select 20% of the original instances for each dataset and generate their corresponding meaning-preserving instances. Since there are six different operations, each original training set is augmented with six different meaning-preserving sets independently.
Training Details For both supervised training and adversarial training, we set batch size to 20 for all parsers with 100 epochs. We follow the best settings of three parsers reported in (Dong and Lapata, 2018;Yin and Neubig, 2018) to set the remaining hyperparameters. The best performing implementation of SEQ2SEQ is included in Yin and Neubig (2018).

Results and Analysis
We discuss experimental results by addressing the following research questions: RQ1: How do the SOTA parsers perform on meaning-preserving test sets? All three SOTA semantic parsers are trained on each training set before they are tested on the corresponding standard test sets and meaning-preserving test sets. Besides the overall results on whole test sets, Table 2 reports accuracy on each example subset perturbed by the respective perturbation operation. The results on those subsets are further compared   with the standard accuracy on the corresponding examples before perturbation.
As shown in Table 2, SOTA semantic parsers suffer from significant performance drop in almost all meaning-preserving test sets compared to the results on standard test sets. The performance ranking among the three SOTA semantic parsers varies across different datasets. COARSE2FINE achieves the best performance on GEOQUERY and ATIS, while SEQ2SEQ beats COARSE2FINE and TRANX on JOBS. Although COARSE2FINE achieves better accuracy than TRANX on the standard test set of JOBS, it falls short of TRANX on the meaning-preserving test set of JOBS. A parser achieving higher standard accuracy does not necessarily obtain better perturbation accuracy against meaning-preserving perturbations than its competitors.
Our evaluation framework supports also indepth analysis on the impact of different perturbation method on semantic parsers. All parsers are more vulnerable to sentence-level perturbations than word-level ones. Although reordering changes only word order in utterances, it leads to the lowest perturbation accuracy of all parsers on GEOQUERY and ATIS. Among word-level perturbation operations, substitution of non-function words is more challenging than deletion and insertion of function words on GEOQUERY and ATIS. On JOBS and ATIS, even deletion or substitution of function words can impose significant challenges for the parsers. When we further investigate the deleted or replaced function words, such as in Table 1, semantic parsers are not expected to rely on such information. As all perburbations in our meaning-preserving test sets do not change meanings of original utterances, it imposes new research challenges on how to make semantic parsers resist meaning-preserving perturbations as well as how to avoid overfitting on semantically insignificant words.
RQ2: What kind of perturbed examples particularly degrade parser performance? Although perturbation accuracy allows comparing parsers on the same test sets, it includes the examples that a parser fails to parse correctly both before and after the meaning-preserving perturbation. Robust accuracy focuses on the examples a parser parses successfully before perturbation but fails after that. We investigate all parsers trained without adversarial training in terms of this measure. As shown in Table 3, TRANX is superior to the other two parsers on JOBS, and COARSE2FINE is the clear winner on GEOQUERY. This ranking of parsers is consistent with perturbation accuracy in the two domains. However, the differences among three parsers on ATIS are marginal, while COARSE2FINE achieves significantly superior perturbation accuracy in the same domain. ATIS is the domain with the most diverse paraphrases in natural language, COARSE2FINE cannot significantly outperform the other two parsers against meaning-preserving perturbations.
We further investigate adversary examples, which are defined in Eq. 2, for each parser in the meaning-preserving test sets. The shared adversarial examples among the parsers vary significantly across domains. More than 50% of the ad- versarial examples are shared among the parsers on ATIS, but only less than 25% of adversaries are shared on JOBS. Fig. 1 illustrates which perturbation operation contributes to the intersection of the adversary set from different parsers. Back translation contributes almost half of the shared perturbed examples on JOBS, while the proportion of word-level perturbation operations reaches nearly 70% on ATIS. Although reordering and back translation impose the most difficult challenges for parsers, they cannot generate a significant number of valid adversarial examples. In contrast, it is relatively easy to resist a word substitution attack, but this operation can be easily applied to generate a large number of adversaries. ( capital:c ( argmax $1 ( state:t $1 ) ( density:i $1 ) ) ) SEQ2SEQ ( lambda $0 e ( and ( capital:t $0 ) ( loc:t $0 ( argmax $1 ( state:t $1 ) ( density:i $1 ) ) ) ) ( density:i $0 ) ) COARSE2FINE ( argmax $0 (state:t $0 ) (density:i $0 ) ) TRANX ( lambda $0 e ( and ( capital:t $0 ) ( loc:t $0 ( argmax $1 ( state:t $1 ) ( density:i $1 ) ) ) ) ) Delection(f) Utterance what is the capital of the state with the largest population density Adv. Utterance what is the capital of the state with the largest population density Ground Truth ( capital:c ( argmax $1 ( state:t $1 ) ( density:i $1 ) ) ) SEQ2SEQ ( argmin $0 ( capital:t $0 ) ( density:i $0 ) ) COARSE2FINE ( argmax $0 (capital:t $0 ) (population:i $0 ) ) TRANX ( argmax $0 ( capital:t $0 ) ( density:i $0 ) ) Substitution(nf) Utterance what are the major river in ohio Adv. Utterance what are the main river in ohio Ground Truth ( lambda $0 e ( and ( major:t $0 ) ( river:t $0 ) ( loc:t $0 ohio ) ) ) SEQ2SEQ ( lambda $0 e ( and ( river:t $0 ) ( loc:t $0 ohio ) ) ) COARSE2FINE ( lambda $0 ( and (river:t $0 ) (loc:t $0 ohio ) ) ) TRANX ( lambda $0 e ( and ( river:t $0 ) ( loc:t $0 ohio ) ) ) Back Translation Utterance list job using sql Adv. Utterance what kind of work do you have with sql Ground Truth job ( ANS ) , language ( ANS , 'sql' ) SEQ2SEQ job ( ANS ) , application ( ANS , 'sql' ) COARSE2FINE job ( ANS ) , language ( ANS 'sql' ) , language ( ANS 'sql' ) TRANX job ( ANS ) , language ( ANS 'sql' ) , language ( ANS 'sql' ) Reordering Utterance what is the lowest point in mississippi Adv. Utterance in s0 , the lowest point in mississippi Ground Truth ( lambda $0 e ( loc:t ( argmin $1 ( and ( place:t $1 ) ( loc:t $1 mississippi:s ) ) ( elevation:i $1 ) ) $0 ) ) SEQ2SEQ ( argmin $0 ( and ( place:t $0 ) ( loc:t $0 mississippi:s ) ) ( elevation:i $0 ) ) COARSE2FINE ( lambda $0 ( loc:t mississippi:s $0 ) ( loc:t (argmin $1 ( and ( place:t $1 ) ( loc:t $1 mississippi:s ) ) ( elevation:i $1 ) ) $0 ) ) TRANX ( argmin $0 ( and ( place:t $0 ) ( loc:t $0 mississippi:s ) ) ( elevation:i $0 ) ) We hand picked representative adversarial examples shared by all parsers and list them in Table 4. The errors made by substitution (f) and insertion often show a sign of overfitting due to the fact that the parsers learn dependencies between non-essential features and predicates. Substitution (nf) sometimes causes predicates to be missing af-ter replacing a word with its synonym. The errors caused by deletion, back translation, and reordering are more diverse, including adding wrong predicates or missing predicates.  tween standard accuracy and robust accuracy in most occasions. This finding is attributed to the presence of non-robust features, which are highly predictive but incomprehensible for humans (Ilyas et al., 2019). To verify the theory, after applying each adversarial training method to each parser, we compare the standard accuracy before and after adversarial training. As shown in Table 5, none of the three adversarial training methods consistently improve standard accuracy across different domains and parsers. TRANX does not have a significant performance drop regardless which adversarial training method is applied. It may be due to the fact that TRANX uses grammar to filter out invalid outputs.

RQ3
MDA cannot consistently improve parsers but also does not hurt parsers' performance in terms of standard accuracy. We conducted t-tests on the standard test sets to assess if MDA with different perturbation operations significantly improves accuracy. The results are negative so that the training examples generated by MDA at least do not hurt parsers' performance while increasing their robustness.
RQ5: How does our meaning-preserving data augmentation method compare with the data augmentation method proposed by (Jia and Liang, 2016)? (Jia and Liang, 2016) is one of the SOTA data augmentation methods for seman-tic parsing. In their work, they show that the augmented examples improve accuracy of predicting denotations. Although there are three methods proposed in (Jia and Liang, 2016), only the method of concatenating two random examples as a new example can be applied in our case. We evaluated this augmentation method with all three parsers. No significant improvement of standard accuracy and robust accuracy is found in all three domains. We conjecture that this is due to the fact that in (Jia and Liang, 2016) they report only improvement of accuracy of denotation matching, not the matching of LFs. In contrast, MDA can effectively reduce the harm of meaning-preserving perturbations.

Conclusion
We conduct the empirical study on robustness of neural semantic parsers. In order to evaluate robustness accuracy, we define first what are adversarial examples for semantic parsing, followed by constructing test sets to measure robustness using a scalable method. The outcome of this work is supposed to facilitate semantic parsing research by providing a benchmark for evaluating both standard accuracy and robustness metrics.