RockNER: A Simple Method to Create Adversarial Examples for Evaluating the Robustness of Named Entity Recognition Models

To audit the robustness of named entity recognition (NER) models, we propose RockNER, a simple yet effective method to create natural adversarial examples. Specifically, at the entity level, we replace target entities with other entities of the same semantic class in Wikidata; at the context level, we use pre-trained language models (e.g., BERT) to generate word substitutions. Together, the two levels of at- tack produce natural adversarial examples that result in a shifted distribution from the training data on which our target models have been trained. We apply the proposed method to the OntoNotes dataset and create a new benchmark named OntoRock for evaluating the robustness of existing NER models via a systematic evaluation protocol. Our experiments and analysis reveal that even the best model has a significant performance drop, and these models seem to memorize in-domain entity patterns instead of reasoning from the context. Our work also studies the effects of a few simple data augmentation methods to improve the robustness of NER models.


Introduction
Recent named entity recognition (NER) models have achieved great performance on many conventional benchmarks such as CoNLL2003 (Tjong Kim Sang, 2002) and OntoNotes 5.0 (Weischedel et al., 2013). However, it is not clear whether they are reliable in realistic applications in which entities and/or context words can be out of the distribution of the training data. It is thus important to audit the robustness of NER systems via natural adversarial attacks. Most existing methods for generating adversarial attacks in NLP focus on sentence classification (Jin et al., 2020;Li et al., 2020;Minervini and Riedel, 2018) and question answering (Jia and Liang, 2017;Ribeiro et al., 2018;1 Our code and data are publicly available at the project website: https://inklab.usc.edu/rockner.

Original NER Examples
I thank my Beijing [GPE] friends and wish everyone a Happy New Year [EVENT] .
Natural Adversarial Examples (Entity-only) I thank my Bari [GPE] friends and wish everyone a Happy Casimir Pulaski Day [EVENT] .
Natural Adversarial Examples (Entity + Context) I admire my Bari [GPE] roommates and wish everyone a Happy Casimir Pulaski Day [EVENT] . Gan and Ng, 2019), and these methods lack special designs reflecting the underlying compositions of the NER examples -i.e., entity structures and their context words. In this paper, we focus on creating general natural adversarial examples (i.e., real-world entities and human-readable context) for evaluating the robustness of NER models.

Context-level Attack
As shown in Figure 1, given a NER example, our method first generates entity-level attacks by replacing the original entities with entities from Wikidata and then uses a pre-trained masked language model like BERT  to generate context-level attacks. We choose the OntoNotes dataset (Weischedel et al., 2013) 2 to showcase ROCKNER because of the dataset's high annotation quality and wide coverage of entity types. Thus, we create a novel benchmark, named OntoRock, for evaluating the robustness of a wide range of modern NER models.
We analyze the robustness of popular existing NER models on the OntoRock benchmark in order to answer three research questions as follows: (Q1) How robust are current NER models? (Q2) Where are the NER models brittle? (Q3) Can we improve the robustness of NER models via data augmentation? Our experiments and analysis provide these main findings: 1) even the best model is still brittle to our natural adversarial examples, resulting in a significant performance drop (92.4% ! 58.5% in F1); 2) current NER models tend to memorize entity patterns instead of reasoning based on the context; also, there are specific patterns for entity typing mistakes; 3) simple data augmentation methods can indeed help us improve the robustness to some extent. We believe the proposed Rock-NER method, the OntoRock benchmark, and our analysis will benefit future research to improve the robustness of NER models.

Natural Adversarial Attacks for NER
We present RockNER, a simple yet effective method to generate high-quality natural adversarial examples for evaluating the robustness of NER models by perturbing both the entities and contexts of original examples. We apply the method to the development set and test set of OntoNotes to create the OntoRock benchmark.

Entity-Level Attacks
To generate relevant entities for modifying existing NER data, we collect a dictionary of natural adversarial entities of different fine-grained classes via Wikidata. As shown in Figure 2, our three-stage pipeline is introduced as follows: • (1) Entity Linking: We first use BLINK (Wu et al., 2020) to link each entity in the original examples from its surface form to a canonical entry in Wikidata with a unique identifier (QID), e.g., "Beijing" ! Q956. • (2) Fine-grained Classification: Then, we execute a query to get its fine-grained class via the InstanceOf relation (P31), e.g., Q956 P 31 ! Q1549591 ("big city"). • (3) Dictionary Expansion: Finally, we retrieve additional Wikidata entities within each individual entity class. Given a particular entity such as "Beijing", we collect additional outof-distribution entities, such as "Bari". They are both big cities (GPE type), while the latter one is much less correlated with the context in the training data 3 . To ensure the quality, we manually curate the fine-grained entity classes and remove entities linked incorrectly. We use a different approach for collecting PERSON attack entities because adversarial names can be more efficiently created as random combinations of first names, middle names, and last names, which are collected from the Wikidata person-name list 4 .
To create an evaluation benchmark based on existing datasets (e.g., OntoNotes), we iterate over every original entity and replace it with a randomly sampled adversarial entity from our dictionary sharing the same fine-grained class. We argue that the resulting attacks are both natural -i.e., containing real, valid entities, and adversarial -i.e., the entities are of the same class as the original entities while being out-of-distribution from the training data. Specifically, OntoRock has a much larger vocabulary of entity words than OntoNotes, and these words are rarely seen in the training set (Table 4 in Appendix §B). For example, for GPE and PRODUCT, OntoRock has ⇠3x the number of unique entity words as OntoNotes has (GPE: 461 vs. 1202, PRODUCT: 54 vs. 158). The ratio of seen entity words is also much lower (GPE: 75.92% vs. 17.30%, PRODUCT: 44.44% vs. 7.59%).

Context-level Attacks
To investigate the robustness of NER models against changes to the context, we also create natural attacks on the context words. Our intuition  is to replace context words with words that are semantically related and syntactically valid but out of the distribution of the training data. To this end, we perturb the original context by sampling adversarial tokens via a masked-language model such as BERT. Specifically, for each sentence, we choose semantic-rich words -nouns, verbs, adjectives, and adverbs -as the target tokens to replace. Then, we generate masked sentences with random numbers (at most 3) of [MASK] tokens. These masked sentences are then fed into BERT, which decodes the masked positions one by one from left to right. We use the predicted tokens ranking between 100⇠200, such that the words create more challenging context yet the sentence is still syntactically valid. As there are multiple sampled sentences, we take the one which is the least correlated with the training data. Specifically, we test all candidate sentences on the trained BLSTM-CRF model (which performs the worst among the target NER models) and we select the sentences that cause a performance drop.

OntoRock as a Robustness Benchmark
We create the most challenging version of our ROCKNER attack by applying both entity-level and context-level attacks on the original development and test sets of OntoNotes, forming our OntoRock benchmark. The overall statistics of OntoRock are shown in Table 3 (Appendix §B), alongside the statistics of the original OntoNotes dataset. We showcase RockNER using OntoNotes in this paper because of the dataset's high annotation quality and comprehensive entity-type coverage. However, this method of attack is also appli-cable to other datasets.

Evaluating Robustness of NER Models
In this section, we use our OntoRock dataset to evaluate the robustness of popular NER models including spaCy (Honnibal et al., 2020), Stanza (Qi et al., 2020), Flair (Akbik et al., 2018a), BLSTM-CRF (Lample et al., 2016), BERT-CRF (Devlin et al., 2019b), and RoBERTa-CRF . Model details are described in Appendix §C. We organize our results and analysis as three main research questions and their answers.
Q1: How robust are current NER models?
Main results. We show the F1 scores on the test sets 5 in Table 1. We can see that all NER models have a significant performance drop in the attacked settings (i.e., entity attack only, context attack only, and both); there is a 35% ⇠ 62% relative decrease (in the models' F1) in the fully-attacked setting as compared to their results 6 on the original test set. We find the performance on the original test set is positively correlated with the robustness against our attacks. Thus, models that perform better on in-domain data tend to be also better at handling out-of-distribution examples.
Pre-training & Robustness. BLSTM-CRF is trained solely on the training set of OntoNotes; The NER toolkits such as spaCy and Stanza are trained on more datasets (e.g., CoNLL03); BERT-CRF, Flair, and RoBERTa-CRF are based on pre-trained language models. We can see that, in terms of robustness, NER models with pre-training tend to outperform models without pre-training but with more NER data access, which outperform those trained only on the OntoNotes training set. This observation indicates that pre-training (on corpora or other NER data) leads to better robustness, and better pre-trained models (RoBERTa vs. BERT) have a lower (relative) performance drop. Interestingly, we find that the improved robustness from pre-training mainly comes from the improvement on the entity-level attacks, possibly because of the increased exposure to entities and increased ability to reason using context (see our 1st point in Q2).
Q2: Where are the NER models brittle?
5 Full results on dev and test are reported in Appendix §D. 6 All models are trained on the OntoNotes' training data.  Memorizing or Reasoning? Note that our entity-level attacks aim to test the ability to use the context to infer the entities, as the novel entities themselves are out-of-distribution -i.e., if a model can reason about the context, it should be robust against entity changes. In turn, the contextlevel attacks audit the ability to memorize entity patterns, as the context is changed, making it more challenging to infer from. From Table 1, we can see that all models have a smaller performance drop in context-level attacks and a larger performance drop in entity-level attacks. Therefore, we conclude that NER models are apt to memorize entity patterns presented in the training data and are more brittle when emerging, out-of-distribution entities exist in the inputs. This also suggests that current NER models tend infer the type and boundary of entities without properly using the context. To make NER models more robust, we believe an important future direction is to develop context-based reasoning approaches, taking advantage of inductive biases such as entity triggers (Lin et al., 2020).
Error Analysis. To analyze the additional errors caused by our attacks, we look at each truth entity and inspect the changes of model behaviors in this position. We pair each original entity with its overlapped prediction and categorize it as follows: (1) whether the predicted type matches (Correct/Wrong); (2) the number of different tokens between the prediction and truth (d). In Figure 3 (
We take a closer look by calculating the difference between the models' confusion matrices on the attacked and the original test data (i.e., OntoRock's minus OntoNotes'), as shown in Figure 3 (right). This confusion-difference matrix reveals the model's weakness in handling novel entities, especially when making decisions between closely related categories. For example, the biggest difference is the typing error from LOC to GPE (increased by 30 points) 7 -indicating that the model struggles to recognize names of countries/cities/states that are not covered by the distribution of training data.
Apart from that, we find that the entity-level and context-level attacks succeed in different parts of examples. We denote the sets of entity spans that are mistakenly predicted in entity-only attacks and context-only attacks as S E and S C . Their Jaccard similarity is only 0.04, which shows that these two attacks target different kinds of weaknesses.
Q3: Can we improve the robustness of NER models via data augmentation?
Methods. The most straightforward method to improve NER robustness is to augment our examples used for training models. Here we use three intuitive data augmentation methods for the analysis. 1) Entity Switching: we replace each entity in the target sentence with a different entity of the same type from another sentence. 2) Random Masking: for each entity, we replace every one of its letters with a random one. We retain the same capitalization pattern and keep all stopwords unchanged. 3) Mixing up: inspired by Guo et al. (2019), we randomly pick one entity from the target sentence and find another sentence that includes an entity of the same type; then we generate an adversarial sentence by merging the first half of the target sentence (up to and including the entity) with the second half of the second sentence (everything after the entity). They are illustrated with examples in Figure 4.
Results & Analysis. The results of the three methods on the RoBERTa-CRF model are shown in Table 1. Surprisingly, the most straightforward method, Random Masking, offers the best improvement against entity-level attacks. We conjecture it is because it provides more entity patterns, which enhances its entity-level generalization ability and makes models focus more on the context for inference, resulting in a better performance on entity-level attacks (63.4% ! 66.3%). As the Entity Switching repeats original entities in the different context of the training set, it aims to improve the performance in using context to infer entities. The entity-level attacks are indeed better handled (63.4% ! 64.7%). The Mixing up method, however, loses the robustness on all settings, possibly due to potential noise from sentences that are not syntactically valid.

Related Work
There are other recent works which also turn their attention from achieving a new state-of-the-art of NER model towards studying NER models' robustness and generalization ability. Agarwal et al. (2020a) create entity-switched datasets by replacing entities with others of the same type but different national origin. They find that NER models perform worse on entities from certain countries. Mayhew et al. (2020) and Bodapati et al. (2019) focus on the robustness when inputs are not written in the standard casing (e.g., "he is from us" ! "US"). Fu et al. (2020) analyze the generalization ability of current NER models by evaluating them across datasets. Agarwal et al. (2020b) further analyze the roles of context and names in entity predictions made by models and humans. Although these works begin to understand the robustness issue of NER models, they do not build an automated pipeline to generate natural adversarial instances with large coverage (e.g. thousands of fine-grained classes) at scale.
There are also works in other domains aiming to evaluate models' robustness with perturbed inputs. For example, Jia and Liang (2017) attack reading comprehension models by adding word sequences to the input. Gan and Ng (2019)

Conclusion
Our contributions in this short paper are two-fold. 1) resource-wise: we develop RockNER, a straightforward method for generating natural adversarial attacks for NER, which produces OntoRock, a benchmark for auditing the robustness of NER models. 2) evaluation-wise: our experimental results and analysis provides answers supported by experimental results to three main research questions on the robustness of current mainstream NER models. We believe RockNER and its produced attacks (e.g., the OntoRock benchmark) can benefit the community working to increase the robustness and out-of-distribution generalization of NER. 8

A Statistics of the Entity Dictionary
In   We adopt the train/dev/test splits of OntoNotes used by Pradhan et al. (2013) in our experiments. Table 3 presents the statistics of our OntoRock benchmark, which consists of the original OntoNotes training set and our attacked (full version) development and test sets. Table 4 shows the statistics of entities in the training set, OntoNotes' test set and OntoRock's test set.

C Model Details
For spaCy, we load the "en_core_web_lg" model with the white-space tokenizer. For the Stanza model, we use the English model and set processors as "tokenize, ner" with tok-enize_pretokenized= True.
When we train the Flair model with a GPU, we set mini_batch_size as 64, train_with_dev as False and embeddings_storage_mode as "none".
For training BLSTM-CRF, BERT-CRF and RoBERTa-CRF models, we set batch_size as 20. We use early stopping and set patience=10 for BLSTM-CRF and 5 for the other two.

D Full Results
Precision/Recall/F1 scores for each model on the original OntoNotes and our OntoRock benchmark are presented in Table 6 (test set) and Table 7 (development set).

E Cases
In Fig. 5, we show examples of entity-level attacks on the RoBERTa. These examples should be easily solved based on the context. For example, "a host of" in sentence 1 and "holiday" in sentence 4 are both explicit clues. If NER models are capable of inferring from context, those clues could have assisted them to achieve better performance. It qualitatively validates our hypothesis that NER models tend to remember entity patterns instead of inferring entity labels from context.

F More Details of Error Analysis
In Figure 7, we present the confusion matrices for RoBERTa-CRF model on the OntoNotes' and OntoRock's teset sets. We use them to calculate the confusion difference matrix (Figure 3 (right)).    Following Figure 3 (left), we categorize the error cases of predictions by the RoBERTa-CRF model and its variants that are trained with augmented data. The results are presented in Table 5. Among three augmentation methods, random masking gets the highest F1 score on both attacked and original test sets. The robustness gain mainly comes from more accurate typing (Wrong Type, d = 0: 14.27% ! 12.88%).

G Attacking Curve
For the entity-level attacks, we conduct 5 separate attacks by replacing 20%, 40%, 60%, 80%, and 100% of the entities in the test set. For each model, we evaluate it on the 5 generated test sets and plot a curve of the F1 scores for each attack, shown in Fig. 6. The descending trend is intuitive, and the performance of weak models drops more rapidly than the performance of strong models.