SeqAttack: On Adversarial Attacks for Named Entity Recognition

Named Entity Recognition is a fundamental task in information extraction and is an essential element for various Natural Language Processing pipelines. Adversarial attacks have been shown to greatly affect the performance of text classification systems but knowledge about their effectiveness against named entity recognition models is limited. This paper investigates the effectiveness and portability of adversarial attacks from text classification to named entity recognition and the ability of adversarial training to counteract these attacks. We find that character-level and word-level attacks are the most effective, but adversarial training can grant significant protection at little to no expense of standard performance. Alongside our results, we also release SeqAttack, a framework to conduct adversarial attacks against token classification models (used in this work for named entity recognition) and a companion web application to inspect and cherry pick adversarial examples.


Introduction
Named Entity Recognition (NER) is the task of recognizing named entities in a chunk of text. Named entities are words (one or more) belonging to a particular semantic category, such as location, person or organization. NER is used both as a standalone tool and as an essential component in several Natural Language Processing (NLP) pipelines, such as Information Retrieval (Petkova and Croft, 2007) and Machine Translation (Babych and Hartley, 2003). Traditionally, NER has been attempted with rule-based approaches, Hidden Markov Models and Conditional Random Fields (Li et al., 2020a). In recent years, deep learning has outperformed these methods (Li et al., 2017) (Liu et al., 2019a), especially with the introduction of general-purpose language models such as BERT (Devlin et al., 2019).
Neural networks are vulnerable to adversarial attacks, which can be defined as processes that craft incorrectly-predicted samples from correctlypredicted inputs by applying small perturbations, an example of which can be seen in Figure 1. This shows that deep learning models are fragile and might not be ready for deployment in a critical scenario. The most popular technique to overcome this issue is adversarial training, which uses adversarial attacks to craft additional training samples and retrains the model from scratch (Li et al., 2020b) (Li et al., 2021). Adversarial attacks and training were largely explored with regards to text classification, but current research on NER has only explored attacks based on adversarial typos (Araujo et al., 2020) and the effectiveness of more complex attacks (at the word and sentence levels) is unknown. Word-level attacks are particularly important because they generate adversarial examples highly likely to appear in the real world, providing valuable additional training data (an example can be seen in Figure 1). This paper aims to tackle this problem by investigating the following research questions: • RQ1: How robust are named entity recognition models against adversarial attacks at the character, word and sentence level? In particular, this paper focuses on a BERT base cased model trained on CoNLL2003 (Tjong Kim Sang and De Meulder, 2003) in order to maintain consistency across the paper. Figure 1: Word-level adversarial example for NER from CoNLL2003 (Tjong Kim Sang and De Meulder, 2003). Changing standings to ranking induces an incorrect classification of Super G as a non-entity. Several attack strategies are available to fool text classification models. In this paper, we follow the taxonomy by (Yuan et al., 2019), focusing on the properties in the list below, with the addition of granularity (Zhang et al., 2020): • Model knowledge: if all the model information is known, attacks are defined as white box. Black box attacks instead have access only to the confidence scores. This paper focuses on black box attacks.
• Specificity: attacks which aim to change the model's prediction to a specific class are called targeted, whereas untargeted attacks consider any incorrect prediction valid.
Some popular attack strategies organized by granularity are presented below.

Attack strategies
At the character-level DeepWordBug (Gao et al., 2018) generates at each step candidate adversaries by swapping adjacent characters, substituting a character with a random one, deleting or inserting a character. At the word-level TextFooler  ranks the words in a sample by prediction relevance and replaces the most important ones using a word embedding optimized for synonyms (Mrkšić et al., 2016). BERT-Attack (Li et al., 2020b) and CLARE (Li et al., 2021) operate similarly, but they respectively use BERT and DistillRoBERTa (Sanh et al., 2019) (Liu et al., 2019b) as language models to suggest potential candidates. CLARE supports token replacements, insertions, and merges. Meanwhile, BERT-Attack and TextFooler only support token replacements. All word-level attacks enforce a semantic similarity constraint using the Universal Sentence Encoder (Cer et al., 2018). Finally, at the sentence-level, SCPN (Iyyer et al., 2018) generates paraphrases that match one of its built-in syntactic forms.
In comparison to text classification, to the authors' knowledge, adversarial attacks (and training) for NER only appears in two work (Araujo et al., 2020) and (Wang et al., 2020). The former tackles biomedical NER, showing that BERT-based models are susceptible to character swaps, keyboard typo noise and synonym-based entity-word substitutions. The latter integrates adversarial training in the train loop of an LSTM-CNN: at each training step adversarial examples are obtained by perturbing the word embeddings directly. This paper contributes by evaluating a larger number of attack strategies and the portability of adversarial attacks for text classification to token classification problems. Moreover, we provide new insights and a comparison of the samples generated by the different attack strategies.

Adversarial training
Adversarial training aims to improve a model's robustness using adversarial examples. This task can be achieved mainly in two ways: via data augmentation and by integrating adversarial training within the model train loop.
The first method attacks the victim model using the training set as the attack input and, once obtained enough samples, retrains the model from scratch. One of the first work to use this technique is (Alzantot et al., 2018), in which the authors adversarially train a sentiment classification model on the IMDB dataset without success. Later work, such as (Li et al., 2020b) and (Li et al., 2021) show more interesting results: the former uses adversarial training to make a natural language inference model more robust, gaining 15% after-attack accuracy at the expense of a minimal test accuracy loss. The latter adversarially trains BERT and TextCNN models on the AG news dataset obtaining similar improvements: without loss of test accuracy the authors manage to reduce the attack rate by 12.3% and 3.5% for BERT and TextCNN respectively. The second method is used by (Wang et al., 2020), where adversarial training is integrated in the training loop using a loss function that takes into account adversarial perturbations. Using this technique, the authors improve the model's generalizability by reducing overfitting.

The SeqAttack framework
The most popular frameworks for conducting adversarial attacks are TextAttack (Morris et al., 2020) and OpenAttack (Zeng et al., 2021), but they do not support token classification problems such as named entity recognition, in which each token is either classified as being the beginning of (B), inside (I) or outside an entity (O) according to the inside-outside-beginning (IOB) schema (Ramshaw and Marcus, 1995). In order to attack NER models we developed SeqAttack, a framework for conducting adversarial attacks against token classification models. The framework extends TextAttack and inherits its design, where attacks are composed of a goal function (the objective to optimize), transformations (how the input text is perturbed), constraints which limit the candidate perturbations and a search method. The framework can be used by NLP practitioners to attack models, for data augmentation and to quickly prototype attack strategies. Inheriting the structure of TextAttack also means that its attack strategies can be easily ported and used against NER models. In TextAttack, every attack optimizes a goal function, which in the case of text classification is defined as 1 − pŷ. Whereŷ is the ground truth and pŷ is the normalized confidence score for the ground truth. In SeqAttack, in order to support NER, the goal function is reformulated as follows: Where y is the model prediction, N the number of tokens in the sample and x the attacked sample. countEntities(x) returns the number of entity tokens in a sample. We call this function the untargeted NER goal function. goal(y,ŷ) considers valid any incorrect classification of an entity token. It's important to note that this function assigns no score to newly introduced entities. This is due to the fact that the CoNLL2003 metrics consider only the classification of ground truth named entities. We also define the untargeted-strict NER goal function, which assigns no score to flips between I-CLS and B-CLS. Figure 3 highlights the difference between the two goal functions.

Adversarial attacks
This paper employs attack strategies implemented in TextAttack that proved to be successful for text classification to attack NER models with minor adaptations. In particular the following modifications were applied:

DeepWordBug
We use two different versions of this attack strategy: DeepWordBug-I, true to the original implementation and DeepWordBug-II, which is not allowed to modify named entities. Both attacks have a Levenshtein distance constraint, whose maximum allowed distance is specified with a subscript, as in DeepWordBug-I 5 .

BERT-Attack
The sentence similarity constraint was set to 0.4 and the replacement of numeric tokens with alphanumeric ones was forbidden (i.e. "4" cannot be replaced by "car"). Only non-entity tokens are allowed to be replaced (to avoid the generation of trivial examples, e.g. swapping a location with a person's name) and candidate replacements which are named entities are also rejected (e.g. the candidate replacement "Amsterdam" will be rejected). The attack can perturb up to 40% of the words in a sample.

CLARE
The implementation of CLARE used in this paper only supports replacements and insertions. Similarly to BERT-Attack, the replacement of entity tokens is forbidden and candidate replacements which are named entities are rejected. When a new token is inserted, it is automatically labelled as being outside an entity (O). If a token insertion splits a named entity the beginning/inside labels will be adjusted accordingly.

SCPN
Using the OpenAttack implementation, the algorithm iteratively generates candidate paraphrases, using the original sample or a paraphrase as the starting point. The candidates are processed to remove identical consecutive unigrams and bigrams, and only the candidates which preserve at least one named entity are kept. Every token which is not an entity in the original sample is labelled as being outside an entity (O). An example can be seen in Figure 2.

Adversarial training
This paper approaches adversarial training using the training dataset augmentation strategy: we attack the model using its training set as the input, generating at most one adversarial example per train sample, and we retrain the model with the augmented dataset. DeepWordBug-I 5 and BERT-Attack were chosen as the attack strategies so as to investigate the different effect of word-level and character-level adversarial training.

Adversarial attacks
The attack techniques in section 3 were evaluated on a BERT base cased model (Devlin et al., 2019), fine tuned on the CoNLL2003 dataset for three epochs using the transformers library (Wolf et al., 2020). All attacks use the untargeted-strict goal function and target a subset of 256 samples from the test set, selected such that the model incorrectly predicts up to 10% of the entities contained in each sample. For each sample, the attack is allowed up to 120 seconds and a maximum of 512 model invocations (queries).

Evaluation metrics
The attacks are evaluated following previous work (Li et al., 2021), , (Morris et al., 2020), which employ the following automated metrics (in addition to accuracy, recall and F1 score as in the CoNLL2003 task): • Attack Rate (A-Rate): percentage of adversarial examples that can fool the model. An adversarial example is considered successful when at least one entity is incorrectly classified.
• Modification Rate (Mod): percentage of modified tokens. Insert operations increase by one the modified tokens count (Li et al., 2021).
• ∆ Grammar Errors (∆GErr): difference in the number of grammar errors between the adversarial example and its original counterpart, calculated with LanguageTool (Naber et al., 2003).
• Textual similarity (Sim): cosine similarity between the adversarial example and the original input calculated via the Universal Sentence Encoder.
We also define the Labels Score (L-Score) metric as the percentage of incorrectly classified entities in a sample. All metrics defined above are averaged over the successful samples (with the exception of the attack rate). Table 1 lists the metrics for the original and attacked datasets.  and highlights the differences between the two adversarial training strategies. The model achieves a reasonable performance on the test set (Table 4, last row) but it can be fooled by both DeepWordBug-I 5 (Table 2) and BERT-Attack (Table 3). The adversarial data augmentation was done by attacking NER small using its own training set as the attack input. We respectively generated 2000 and 1000 adversarial examples for DeepWordBug-I 5 and BERT-Attack, which were then used to train robust models, whose performance on CoNLL2003 is listed in Tables 4 and 5.

Model evaluation
To evaluate the effectiveness of adversarial training we ran the same attacks against NER small and its robust counterparts, using the same CoNLL2003 subset used for evaluating attack strategies. Both attack strategies were allowed up to 512 model invocations. DeepWordBug-I 5 and BERT-Attack were respectively allowed up to 45 and 60 seconds to attack each sample.
5 Results and discussion 5.1 Adversarial attacks Table 1 lists the after-attack metrics for the various attack strategies. By observing the metrics we can notice that DeepWordBug-I 5 is the most effective. Its success is most likely due to the fact that it can modify named entities. In fact, when named entities are preserved as in DeepWordBug-II 5 , the attack rate drops to 27% and increasing the Levenshtein distance constraint to 30 has little effectiveness. Word-level attacks are less effective than unconstrained character-level attacks, but perform better than similarly constrained characterlevel attacks, decreasing a model's accuracy by up to 26% in the case of BERT-Attack. Even if less effective, word-level attack strategies may be useful for adversarial training since the generated samples are highly grammatical (introducing less than 0.5 grammar errors per sample), have a low percentage of modified words (except when insertions are used) and maintain a high sentence similarity: 84-86% for BERT-Attack and CLARE versus 77% for DeepWordBug-I 5 . Some adversarial examples generated respectively by BERT-attack and CLARE can be seen in the appendix (Figures 8 and 9). Future work may attempt to apply word-level attacks also on the entities themselves, making sure to preserve the entity class. This would both speed up the adversarial examples generation (due to the higher sensitivity) and uncover examples highly likely to appear in the real world.  Table 3: Comparison of CoNLL2003 models against BERT-Attack, trained using a different amount of adversarial examples generated by BERT-Attack (specified next to the model) and the untargeted goal function. The attack had up to 60 seconds to successfully attack an input sample. ↑ (↓) indicate whether the higher (or lower) the better from the attack perspective.

Adversarial training
Tables 2 and 3 respectively summarize the attack metrics for DeepWordBug-I 5 and BERT-Attack. In line with the adversarial attacks results, DeepWordBug-I 5 obtains a largely better success than BERT-Attack, reducing the model's afterattack accuracy to 52%, where BERT-Attack only manages to reduce the accuracy to 88%. Adversarial training grants a significant protection from both attacks: in the case of DeepWordBug-I 5 (Table 2) adding only 500 samples to the training set already increases the after attack accuracy by 21%, without affecting the test set metrics, causing at the same time an increase in the modification rate and a decrease in the similarity score. The improvement is statistically significant: a paired t-test with regards to the modification rate and the labels score respectively yields p-values of 0.0086 and 2.42e-09, confirming the added robustness of the adversarially trained model. The improvement is also visible in Figure 4, where the labels score distribution of the attacked dataset for the normal model is more skewed towards the right than its robust counterpart, showing a smaller attack success on individual samples for the robust model. Similarly, the modification rate distribution for the normal model is more skewed towards the left, thus more words need to be perturbed to fool the robust model. Adding more samples further improves the after attack scores at a small cost of the standard metrics (Table 4), but the improvement over the robust model with 500 samples is statistically significant only when 2000 adversarial examples are used, and only in regards to the labels score (p = 0.017).
Similarly, the robust models trained with BERTattack have a performance similar to NER small on the test set, even improving the model's F1 score by 1% (Table 5). Using only 500 samples the afterattack accuracy increases by 6% and the attack-rate drops by 4%. Adding more samples further reduces the attack rate (Table 3). Using only 500 samples causes a significant improvement in the modification rate needed to break the model, yielding a p-value of 2.66e-05, but does not grant significant improvement in the labels score (p = 0.4). The latter improves significantly only when 1000 samples are used, where the p-value for the labels score is 0.011. These results are very encouraging, since the added robustness does not affect the test-set metrics and even improves it, suggesting that this attack method could be used for data augmentation in low-resource scenarios, a potential direction for future research. The difference in the number of samples needed to grant a significant robustness against DeepWordBug-I 5 and BERT-Attack may be explained by the initial effectiveness of the attack strategy: the former reduces the baseline accuracy to 52%, meanwhile the latter only reduces it to 88%. Figure 4: KDE plots for the labels score and modification rate distributions for NER small and its robust counterpart, when attacked with DeepWordBug-I 5 . To be successful, the attack needs to alter more words for the robust model, and nonetheless achieves a lower labels score on average.

Conclusion
In this paper we showed that NER models are vulnerable to adversarial attacks at the character, word and sentence level. When allowed to alter named entities, DeepWordBug is the most effective, but it produces highly ungrammatical samples (appendix Figures 6, 7). Thus, character-level attacks are not recommended for adversarial training or data augmentation since the produced samples are unlikely to appear in a real-world setting. Word-level attacks instead produce more fluent adversarial examples (appendix Figures 8,9), which can be used both for adversarial training and for data augmentation. Finally, with regards to sentence-level attacks, this paper finds that they often produce low-quality samples for this particular dataset (appendix Figure 5). This is probably due to the fact that SCPN paraphrases are generated following a small set of target syntactic forms, which are incompatible with CoNLL2003. Further research in this direction is recommended, as paraphrasing methods produce a richer variety of samples and may reveal weaknesses in a model which cannot be discovered by character-level or word-level attacks.
To help NLP practitioners evaluate and improve their models' robustness and to foster research on adversarial attacks in token classification (and named entity recognition) we release SeqAttack 1 , a Python library for conducting adversarial attacks against token classification models. The library is accompanied by a web application 2 to inspect the generated adversarial examples and cherry pick them for adversarial training.