A Closer Look into the Robustness of Neural Dependency Parsers Using Better Adversarial Examples

Previous work on adversarial attacks on dependency parsers has mostly focused on attack methods, as opposed to the quality of adversarial examples, which in previous work has been relatively low. To address this gap, we propose a method to generate high-quality adversarial examples with a higher number of candidate generators and stricter ﬁlters, and then verify their quality using automatic and human evaluations. We perform analysis with different parsing models and observe that: (i) injecting words not used in the training stage is an effective attack strategy; (ii) adversarial examples generated against a parser strongly depend on the parser model, the token embed-dings, and even the speciﬁc instantiation of the model (i.e., a random seed). We use these insights to improve the robustness of English parsing models, relying on adversarial training and model ensembling. 1


Introduction
Neural network-based models have achieved great successes in a wide range of NLP tasks. However, recent work has shown that their performance can be easily undermined with adversarial examples that would pose no confusion for humans . As an increasing number of successful adversarial attackers have been developed for NLP tasks, the quality of the adversarial examples they generate has been questioned (Morris et al., 2020).
The definition of a valid successful adversarial example differs across target tasks. In semantic tasks such as sentiment analysis (Zhang et al., 2019) and textual entailment (Jin et al., 2020), a valid successful adversarial example needs to be able to alter the prediction of the target model while * Work partially done while at the University of Edinburgh. 1 Our code is available at: https://github.com/ WangYuxuan93/DepAttacker.git preserving the semantic content and fluency of the original text. In contrast, in the less explored field of attacking syntactic tasks, the syntactic structure, rather than the semantic content, must be preserved while also maintaining the fluency. Preserving the syntactic structure enables us to use the gold syntactic structure of the original sentence in the evaluation process. While preserving the fluency ensures that ungrammatical adversarial examples, which not only fool the target model but also confuse humans, will not be considered valid. Therefore in this paper, we evaluate the quality of an adversarial example in two aspects, namely the fluency and syntactic structure preservation.
Recently, Zheng et al. (2020) proposed the first dependency parser attacking algorithm based on word-substitution which depended entirely on BERT (Devlin et al., 2019) to generate candidate substitutes. The rational was that the use of the pre-trained language model will ensure fluency of the adversarial examples. However, we find that using BERT alone is far from enough to preserve fluency. Therefore, in this paper, we propose a method to generate better adversarial examples for dependency parsing with four types of candidate generators and filters. Specifically, our method consists of three steps: (i) determining the substitution order, (ii) generating and filtering candidate substitutes for each word, (iii) searching for the best possible combination of substitutions, based on pre-computed candidates and the substitution order. We verify the superiority of the proposed method in terms of syntactic structure preservation and fluency using both automatic and human evaluations, and further show the limitation of the previous BERT-based method. Table 1 shows adversarial examples generated by our and the method of Zheng et al. (2020), demonstrating that examples generated by our method are Boeing received a $ 46 million Air Force America contract for developing securing cable systems for the Minuteman Missile. example-3 He used better than 5,000 words heaping scorn on the witnesses eyewitnesses for exercising the Fifth.
He used better than 5,000 words times heaping scorn on the witnesses dollars for exercising the Fifth grand. With the proposed attacking method, we evaluate the robustness of different parsing models and analyse the properties of adversarial attacks. We find that (i) the introduction of out-of-vocabulary (OOV, words not in the embedding's vocabulary) and out-of-training (OOT, words not in the training set of the parser) words in adversarial examples are two main factors that harm models' performance; (ii) adversarial examples generated against a parser strongly depend on the type of the parser, the token embeddings and even the random seed.
Adversarial training (Goodfellow et al., 2015), where adversarial examples are added in the training stage, has been commonly used in previous work (Zheng et al., 2020;Han et al., 2020) to improve a parser's robustness. Only a limited number of adversarial examples have been used in such cases, and Zheng et al. (2020) argued that overuse of them may lead to a performance drop on the clean data. However, we show that with improvement in the quality of adversarial examples produced in our method, more adversarial examples can be used in the training stage to further improve the parsing models' robustness without producing any apparent harm in their performance on the clean data. Inspired by our second finding, we propose to improve the parsers' robustness by combining models trained with different random seeds and embeddings. Such methods, which are not targeting specific types of attacks, should improve the capacity to defend against new attacks as compared to standard adversarial training.

Method
In this section, we first give a formal definition of a dependency parsing attack. Then we describe the proposed attacking method for dependency parsing, shown in Algorithm 1. It consists of three steps, namely ranking word importance (lines 1-4), generating candidates for substitution (line 7) and searching for the best substitute combination (lines 8-21).

Problem Definition
Given an input text space X containing all possible input sentences x and an output space Y containing all possible dependency trees of x, a parser F : X → Y learns to map the sentence x to its corresponding tree y, denoted by F (x) = y. The i-th word of x is denoted by x i . For sentence x, a valid adversarial example x * is crafted by adding a perturbation to x so that where σ is a constraint function and ensures that i) the perturbation is imperceptible, ii) the true dependency tree of x * should be the same as that of x. In this paper, these two constraints are ensured through the use of various filters (see Section 2.3) and are used to evaluate the quality of adversarial examples (see details on fluency and syntactic structure preservation in Section 3.3).

Word Importance Ranking
Word importance ranking in our model is based on the observation that some words have a stronger influence on model prediction than others. Such word importance is typically computed by setting each word to unknown and examining the changes in their predictions Ren et al., 2019).

Algorithm 1 Dependency Parsing Attack
Input: Sentence example x (0) = {x1, x2, . . . , xN }, maximum percentage of words allowed to be modified γ Output: Adversarial example x (i) 1: for i = 1 to N do 2: Compute word importance I(x (0) , xi) via Eq. 1 3: end for 4: Create a set W of all words xi ∈ x (0) sorted by the descending order of their importance I(x (0) , xi). 5: t = 0 6: for each word xj in W do 7: Build candidate set Cj for xj following the Candidate Substitute Generating step 8: Initialise valid candidate set VC ← {} 9: for each candidate c k in Cj do 10: Compute the accuracy change S(x (t) , c k , j) via Eq. 3 11: if S(x (t) , c k , j) ≤ 0 then continue end if 12: Add c k to the set VC 13: end for 14: if VC is not empty then 15: end if 20: end for 21: if t > 0 then return x (t) else return None end if This helps to determine the word substituting order in the proposed method.
In this work, we use a combination of the changes found in the unlabelled attachment score (UAS) and in the labelled attachment score (LAS) to measure word importance. Specifically, the importance of a word x i in sentence x is computed as are the changes in UAS and LAS respectively. λ arc is a coefficient that controls the relative importance of dependency arcs and their labels.

Generation of Substitute Candidates
Generating substitute candidates is a critical step, as it significantly influences the attack success rate and the quality of generated adversarial examples. Zheng et al. (2020) relied entirely on BERT to generate candidates, but this limits the quality of the adversarial examples. To alleviate this problem, we first collect candidate substitutes from four generation methods, then apply filters to discard inappropriate substitutes, ensuring both diversity and quality of the generated candidates.

Generating Process
We collect substitutes from the following methods: BERT-Based Method: We use BERT to generate candidates for each target word from its context. This method generates only single subwords.
Embedding-Based Method: Following Alzantot et al. (2018), we use word embeddings of Mrkšić et al. (2016) 2 to compute the N nearest neighbours of each target word according to their cosine similarity and use them as candidates.
Sememe-Based Method: The sememes of a word represent its core meaning (Dong and Dong, 2006). Following Zang et al. (2020), we collect the substitutes of the target word x based on the rule that one of the substitutes the senses of x * must have the same sememe annotations as one of senses of x.
Synonym-Based Method: We use WordNet 3 to extract synonyms of each target word as candidates.

Filtering Process
We apply the following four types of filters to discard candidates which are likely inappropriate, either in terms of syntactic preservation or fluency.
POS Filter: We first filter out substitutes with different part-of-speech (POS) tags from the original word. 4 This filter is essential for preserving the syntactic structure of the sentence.
Word Embedding Similarity Filter: We use the word embeddings of Mrkšić et al. (2016) to compute the cosine similarity between the original word and each of the substitutes in C and filter out those whose similarities are less than a threshold w . 5 Grammar Checker Filter: We employ an offthe-shelf grammar checker 6 to filter out candidates that may introduce grammar errors. This filter helps to further ensure that the syntactic structure and fluency are preserved.
Perplexity Filter: We employ GPT-2 (Radford et al., 2019) to calculate the perplexity difference between x and x c i for each candidate c: where x c i is x with its i-th word replaced by c, and filter out c whose ∆ppl(x, c, i) > p .

Best Substitute Searching
In this step, we greedily search for the best possible combination of substitutions, relying both on the previously created candidate lists and word substitution order. To preserve the syntactic structure of sentences, we forbid replacement of pronouns, articles, conjunctions, numerals, interjections, interrogative determiners and punctuation. Additionally, we set the maximum percentage of words allowed to be modified γ in the experiments to control the modification number.
Specifically, given a sentence x, we substitute the words following the order computed in the word importance ranking step. For each target word x i , we build an adversarial example x c i = x 1 x 2 . . . c . . . x N for each of its substitutes c. Then we compute the accuracy change score from x to x c i as input to the parser: where are the changes in UAS and LAS, respectively. If the percentage of modified words in the sentence exceeds a threshold γ, we stop the process. Otherwise, we search for a substitute for the next target word.

Target Parsers and Token Embeddings
We choose the following two strong and commonly used English parsers, one graph-based, the other transition-based, as target models, both of which achieve performance close to the state-of-the-art.
Deep Biaffine Parser (Dozat and Manning, 2017) is a graph-based parser that scores each candidate arc independently and relies on a decoding algorithm to search for the highest-scoring tree.
Stack-Pointer Parser (Ma et al., 2018) is a transition-based parser that incrementally builds the dependency tree with pre-defined operations.
We used the following four types of token embeddings to study their influence on each parsers' robustness. To focus on the influence of the embeddings, we use only the embeddings as input to the parsers: GloVe (Pennington et al., 2014) is a frequently used static word embedding.
RoBERTa (Liu et al., 2019) is a pre-trained language model based on a masked language modelling object, which learns to predict a randomly masked token based on its context. It produces contextualised word piece embeddings.
ELECTRA (Clark et al., 2020) is a pre-trained language model based on a replaced token detection object, which learns to predict whether each token in the corrupted input has been replaced. It produces contextualised word piece embeddings.
ELMo (Peters et al., 2018) is a pre-trained language representation model based on character embeddings and bidirectional language modelling.

Datasets and Experimental Settings
We train the target parsers and evaluate the proposed method on the English Penn Treebank (PTB) dataset, 7 converted into Stanford dependencies using version 3.3.0 of the Stanford dependency converter (de Marneffe et al., 2006) (PTB-SD-3.3.0). We follow the standard PTB split, using section 2-21 for training, section 22 as a development set and 23 as a test set.
It is important to note that when converting PTB into Stanford dependencies, Zheng et al. (2020) maintained the copula (linking verbs) as a head when its complement was an adjective or noun. 8 However, since the design objective of Stanford dependency is to maximize dependencies between content words (de Marneffe et al., 2006), a more typical setting is to regard copulas as auxiliary modifiers. Therefore, we first compare with the previous method by performing this step under their settings and further conduct experiments with the typical PTB-SD-3.3.0 dataset for the convenience of follow-up research.
While training the target parsers, we adopt the hyper-parameters from their respective papers. Note that to compare with the biaffine parser, which uses first-order features, we also adopt the basic setting for the stack-pointer parser. 9 When using 7 https://catalog.ldc.upenn.edu/ LDC99T42 8 Referred to as PTB-SD-3.3.0-COP in the rest of the paper. 9 According to our preliminary experiments, neither second-order features nor beam search has an obvious influence on the parser robustness under our attack. RoBERTa, ELECTRA or ELMo embeddings as input, we set the learning rate of these pre-trained models to 2e-5 and that of other parameters to 2e-2.
For the hyper-parameters of each attacking method, we set the word embedding similarity threshold w = 0.7, the candidate perplexity difference threshold p = 20.0, the arc importance coefficient λ arc = 0.5 and the maximum percentage of words allowed to be modified γ = 15%.

Evaluation Metrics
As introduced in Section 2.1, two constraints should be satisfied for an adversarial example to be valid: i) the perturbation is imperceptible, ii) the true dependency tree of x * should be the same as that of x. For the first, we use fluency to measure the imperceptibility of the perturbations, and assume that in a fluent adversarial example the perturbation is imperceptible. For the second, syntactic structure preservation is used to measure whether an adversarial example's true dependency tree is identical to that of the original text. Both automatic and human evaluations are used for analysis.
In the automatic evaluation, GPT-2 (Radford et al., 2019) is used to compute the average perplexity of the adversarially modified PTB test set to measure the overall fluency. In the human evaluation, we ask three annotators to evaluate the quality of adversarial examples in two aspects, namely syntactic structure preservation and fluency. 10 To evaluate the preservation of the syntactic structure, we randomly collect 100 sentences along with their adversarial examples and ask the annotators to decide whether the syntactic structure is preserved in each case. For the fluency evaluation, we randomly collect 100 sentences along with the adversarial examples generated by our method and those produced by the black-box method of Zheng et al. (2020). 11 For each sentence, the annotators are asked to distinguish which example is better with regard to fluency. For both evaluations, we adopt the majority vote for the final results.
To evaluate how successful the attack is, we report the parsing results of the target models on the original and the adversarially modified (afterattack) PTB test set. The results are reported in terms of unlabelled attachment score (UAS) and 10 The three human annotators are postgraduate students with a few years of research experience in syntactic parsing. 11 We thank Zheng et al. (2020) for kindly providing us with the adversarial examples they generated. labelled attachment score (LAS). We also report the attack success rate, namely the percentage of successfully attacked sentences. If the prediction accuracy of the modified sentence is lower than the original one, it is regarded as a successful attack. 12

Comparison with Previous Work
We first evaluate our attacking method on PTB-SD-3.3.0-COP and compare it with previous work (Zheng et al., 2020). Since we focus on the blackbox attack in this paper, we compare with their sentence-level black-box attack against the deep biaffine parser with only word-based embeddings as input. In both their and our settings, 15% of words are allowed to be modified.   Table 2 shows that adversarial examples generated by our method substantially outperform the previous method with regard to fluency and syntactic structure preservation. In the automatic evaluation, the average perplexity of examples generated by our method is 139.99, as compared to 267.96 of those generated by the previous work. For comparison, the average perplexity of the original PTB test set is 127.67, which is very close to ours.
In the human evaluation, results show that for 80% of the sentences, our adversarial examples have better fluency, which further confirms the effectiveness of our method. In addition, 85% of the examples we generated preserve the original syntactic structure, as compared to 75% reported by Zheng et al. (2020), showing that our method also improves the syntactic-structure preservation rate. Table 3 shows the attack results of the two methods. 13 It is clear that with higher quality, the adversarial examples generated by our method cause 12 Note that Zheng et al. (2020) only considered unlabelled scores, so when comparing with these, we use the difference in UAS as the measurement of successful attacks. Conversely, in experiments on PTB-SD-3.3.0, we use the difference in LAS. 13 We only compare UAS here since they did not report LAS in their paper.

Model
Orig-UAS   To further demonstrate the limitation of the BERT-based method which the previous work used as the only candidate generator, we count the average number of candidates from our use of different generators before and after filtering. Results in Table 4 show that although the BERT-based method generates the most candidates before filtering, only 1.89% of them are left after the filters are applied. Whereas the left candidate percentage varies from 5% to 10% for the other three generators. The results further verify that the quality of candidates generated by the BERT-based method is worse than that from the embedding-based, sememe-based and synonym-based methods.

Model
After  To evaluate the ability of the filters, we conduct an ablation study with different combinations of these filters. Results in Table 5 show that the perplexity as well as the attack success rate decreases when more filters are applied. As expected, the greatest perplexity drop is brought by the perplex-ity filter.  We evaluate the robustness of the different parsing models introduced in Section 3.1 on PTB-SD-3.3.0 and report the results in Table 6. First of all, when applied to unperturbed sentences, the graph-based deep biaffine parser performs consistently better than the transition-based stack-pointer parser (using the same embeddings). Among the four kinds of embeddings, the word piece-level embeddings (i.e., ELECTRA and RoBERTa) achieve the highest results, while GloVe yields the lowest results.

Robustness Evaluation of Different Models
As for the adversarially modified sentences, we find that the drop in performance is close between the two families of parsers (using the same embeddings), while the attack success rate against the Stack-Pointer parser is slightly higher. In terms of the embeddings, RoBERTa turns out to be the most robust one, which has the lowest attack success rate and achieves the highest performance on the generated adversarial examples. ELMo is also a comparatively robust embedding. We are surprised to find that although ELECTRA achieves similar performance to RoBERTa on clean input data, it performs poorly on the adversarial examples. We hypothesise that this is due to ELECTRA's training objective, i.e. learning to predict whether a token in a corrupted sentence is genuine or not. With this objective, some of our substitutes can be predicted as incorrect tokens, yielding token representations in the space not encountered by the parser in training, and hence damaging its performance. Lastly, GloVe is the most vulnerable embedding. 14 Vocab. Original

Out-of-Vocabulary and Out-of-Training Words
In this section, we investigate the roles out-ofvocabulary (OOV, words not in the embedding's vocabulary) and out-of-training (OOT, words not in the training set of the parser) words play in dependency parsing attacks. We perform attacks on the Biaffine GloVe models trained with (i) 50k vocabulary (50k), (ii) 400k vocabulary (400k) and (iii) the same 400k vocabulary but where all candidates not in the training set are filtered out (400k (T.)).
The results are shown in Table 7, where we report the attack results along with the number of OOV and OOT words in the adversarially modified words before and after the attack. Firstly, by comparing the OOV and OOT numbers before and after the attack in the 50k model, we find that words chosen to be replaced are often non-OOV and non-OOT, while their substitutes are often OOV and OOT. Secondly, the comparison between the 50k and 400k results shows that when the number of OOV words decreases, the robustness of the model increases. Therefore, it is reasonable to assume that OOV words in adversarial examples cause incorrect predictions. Thirdly, according to the 400k and 400k (T.) results, when the number of OOT words in adversarial examples are reduced to 0 by filtering out all the OOT candidates, the attack success rate drops substantially. Therefore, we have reason to believe that unfamiliar OOT words are another factor degrading a parser's performance.
The OOV problem mostly appears in models using word-level embeddings such as GloVe and can be alleviated by simply increasing the vocabulary size. While for the OOT problem, one potential solution is using adversarial training, where a new parser is trained with a mixture of clean training data and adversarial examples. model in Table 6 we attack another two trained with different random seeds. The experiment shows all the results are stable across seeds.

Adversarial Training
Previous work (Zheng et al., 2020;Han et al., 2020) used a limited number (from 2,000 sentences to half of the training data) of adversarial examples in adversarial training as (Zheng et al., 2020) argued that overuse of them may lead to a performance drop on the clean data. In this section, we investigate the adversarial training strategies on all the parsing models introduced in Section 3.1. Specifically, we generate adversarial examples for the whole PTB training set and retrain parsers on different amount of adversarial examples along with the original training set. Figure 1 shows that as the number of adversarial examples used in adversarial training increases, the robustness of the models increases accordingly. For most of the models, the increase of robustness stops between 50% and 70% of adversarial examples used.     Table 9 and 10 show that the attack success rate always drops when adversarial examples are tested on other models, indicating that the adversarial examples strongly depend on the parser model, the token embeddings and even the spe-   Based on the observations from Section 4.5, we propose to improve the robustness of parsing models using a cross-seed ensemble and crossembedding ensemble. To ensemble multiple parsers, we simply compute the average of the probability distributions across them and use that result as the new distribution in the ensembled model. Figure 2 shows the effect of the cross-seed ensemble, where almost all the attack success rates are dropped with such an ensemble. In addition, it is most effective with ELMo while least effective with ELECTRA and RoBERTa. Table 11 shows the effect of using the crossembedding ensemble, where robustness increases when more models with different token embeddings are ensembled. Moreover, contrary to adversarial training, the ensemble method is not tuned to specific types of attacks and appears robust to 'unseen' attacks, showing that it is more likely to defend against new attacks.

Related Work
Existing textual adversarial attacks have mostly focused on semantic tasks such as sentiment analysis (Zhang et al., 2019) and textual entailment (Jin et al., 2020). Although most of this work has applied various techniques to maintain the fluency of adversarial examples, a recent study by Morris et al. (2020) reported that quite a number of these techniques introduce grammatical errors. In syntactic tasks, Zheng et al. (2020) recently proposed the first dependency parser attacking method which depends entirely on BERT to generate candidates. However, we show that the quality of adversarial examples generated by their method is relatively low due to the limitation of the BERTbased generator, and we propose to generate better examples by using more generators and stricter filters. Han et al. (2020) proposed an approach to attack structured prediction models with a seq2seq model (Wang et al., 2016) and evaluated this model on dependency parsing. They used two reference parsers in addition to the victim parser to supervise the training of the adversarial example generator, and found that the three parsers produce better results when they have different inductive biases embedded to make the attack successful. This finding is quite close in spirit to our conclusion in Section 4.5. Hu et al. (2020) also put forth efforts to modify the text in syntactic tasks while preserving the original syntactic structure. However, their goal is to preserve privacy via the modification of words that could disclose sensitive information.

Conclusion
In this paper, we propose a method for generating high-quality adversarial examples for dependency parsing and show its effectiveness based on automatic and human evaluation. We investigate the robustness of different types of neural dependency parsers. We show that OOV and OOT words are two critical characteristics that cause a performance drop and propose to solve the OOT problem with adversarial training. We further examine three kinds of transferabilities of adversarial examples and propose to improve the robustness of parsing models by ensembling across random seeds and token embeddings.