TMR: Evaluating NER Recall on Tough Mentions

We propose the Tough Mentions Recall (TMR) metrics to supplement traditional named entity recognition (NER) evaluation by examining recall on specific subsets of ”tough” mentions: unseen mentions, those whose tokens or token/type combination were not observed in training, and type-confusable mentions, token sequences with multiple entity types in the test data. We demonstrate the usefulness of these metrics by evaluating corpora of English, Spanish, and Dutch using five recent neural architectures. We identify subtle differences between the performance of BERT and Flair on two English NER corpora and identify a weak spot in the performance of current models in Spanish. We conclude that the TMR metrics enable differentiation between otherwise similar-scoring systems and identification of patterns in performance that would go unnoticed from overall precision, recall, and F1.


Introduction
For decades, the standard measures of performance for named entity recognition (NER) systems have been precision, recall, and F1 computed over entity mentions. 1 NER systems are primarily evaluated using exact match 2 F1 score, micro-averaged across mentions of all entity types. While perentity-type scores available from the conlleval scorer (Tjong Kim Sang, 2002) are often reported, there are no widely-used diagnostic metrics that further analyze the performance of NER systems and allow for separation of systems close in F1. 1 We use the term mention to refer to a specific annotated reference to a named entity-a span of tokens (token sequence) and an entity type. We reserve the term entity for the referent, e.g. the person being named. The traditional NER F1 measure is computed over mentions ("phrase" F1).
2 While partial match metrics have been used (e.g. Chinchor and Sundheim, 1993;Chinchor, 1998;Doddington et al., 2004;Segura-Bedmar et al., 2013), exact matching is still most commonly used, and the only approach we explore. This work proposes Tough Mentions Recall (TMR), a set of metrics that provide a fine-grained analysis of the mentions that are likely to be most challenging for a system: unseen mentions, ones that are present in the test data but not the training data, and type-confusable mentions, ones that appear with multiple types in the test set. We evaluate the performance of five recent popular neural systems on English, Spanish and Dutch data using these fine-grained metrics. We demonstrate that TMR metrics enable differentiation between otherwise similar-scoring systems, and the model that performs best overall might not be the best on the tough mentions. Our NER evaluation tool is publicly available via a GitHub repository. 3

Related Work
Previous work in NER and sequence labeling has examined performance on out-of-vocabulary (OOV) tokens and rare or unseen entities. Ma and Hovy (2016) and  evaluate system performance on mentions containing tokens not present in the pretrained embeddings or training data. Such analysis can be used broadly-Ma and Hovy perform similar analyses for part of speech tagging and NER-and can guide system design around the handling of those tokens. Augenstein et al. (2017) present a thorough analysis of the generalization abilities of NER systems, quantifying the performance gap between seen and unseen mentions, among many other factors. Their work predates current neural NER models; the newest model they use in their evaluation is SENNA (Collobert et al., 2011). While prior work has considered evaluation on unseen mentions, it has focused on experimenting on English data, and the definition of "unseen" has focused on the tokens themselves being unseen (UNSEEN-TOKENS in our TRAINING SET Newcastle [LOC] is a city in the UK [LOC] .
TEST SET John Brown [PER] , the Newcastle [ORG] star from the UK [LOC] , has. . .

Newcastle
John UK [LOC] [ORG] Brown [PER] SEEN UNSEEN-TYPE UNSEEN-TOKENS UNSEEN-ANY Table 1: Example data and how mentions would be classified into unseen and type-confusable mention sets work). We use the umbrella of "tough mentions" to cover a number of possible distinctions that can be made with regards to how unseen test set data is, and we experiment on multiple languages. Mesbah et al. (2018) propose an iterative approach for long-tail entity extraction, focusing on entities of two specific types in the scientific domain.  propose evaluation on a set of unique mentions, which emphasizes the ability of a system to recognize rarer entities. As entities and their types change quickly (Derczynski et al., 2015), recall on emerging entities is becoming a more critical measure in evaluating progress. Ribeiro et al. (2020) propose CHECKLIST, which can be applied to NER by using invariance tests; for example, replacing a mention with another one of the same entity type should not affect the output of the model. Fu et al. (2020) evaluate the generalization of NER models through breakdown tests, annotation errors and dataset bias. They examine the performance on subsets of entities based on the entity coverage rate between train and test set. They also release ReCoNLL, a revised version of CoNLL-2003 English with fewer annotation errors which we use in this work.

Unseen Mentions
Given annotated NER data divided into a fixed train/development/test split, we are interested in the relationship between the mentions of the training and test sets. We classify mentions into three mutually-exclusive sets described in Table 1: SEEN, UNSEEN-TYPE, and UNSEEN-TOKENS, and a superset UNSEEN-ANY that is the union of UNSEEN-TYPE and UNSEEN-TOKENS. UK [LOC] appears in both the training and test set, so it is a SEEN mention. As there is no mention consisting of the token sequence John Brown annotated as any type in the test set, John Brown [PER] is an UNSEEN-TOKENS mention. 4 While there is no mention with the tokens and type Newcastle [ORG] in the training data, the token sequence Newcastle appears as a mention, albeit with a different type (LOC). Newcastle [ORG] is an UNSEEN-TYPE mention as the same token sequence has appeared as a mention, but not with the type ORG.

Type-confusable Mentions
Token sequences that appear as mentions with multiple types in the test set form another natural set of challenging mentions. If Boston [LOC] , the city, and Boston [ORG] , referring to a sports team 5 are both in the test set, we consider all mentions of exactly the token sequence Boston to be type-confusable mentions (TCMs), members of TCM-ALL. We can further divide this set based on whether each mention is unseen. TCM-UNSEEN is the intersection of TCM-ALL and UNSEEN-TOKEN; TCM-SEEN is the rest of TCM-ALL.
Unlike Fu et al. (2020), who explore token sequences that occur with different types in the training data, we base our criteria for TCMs around type variation in the test data. Doing so places the focus on whether the model can correctly produce multiple types in the output, as opposed to how it reacted to multiple types in the input. Also, if type confusability were based on the training data, it would be impossible to have TCM-UNSEEN mentions, as the fact that they are type confusable in the training data means they have been seen at least twice in training and thus cannot be considered unseen. As our metrics compute subsets over the gold standard entities, it is natural to only measure recall and not precision on those subsets, as it is not clear exactly which false positives should be considered in computing precision.

Data Composition
We evaluate using the ReCoNLL English (Fu et al., 2020), OntoNotes 5.0 English (Weischedel et al., 2013, using data splits from Pradhan et al. 2013), CoNLL-2002Dutch, and CoNLL-2002Spanish (Tjong Kim Sang, 2002 datasets. We use Re-CoNLL (Fu et al., 2020) in our analysis instead 4 The matching criterion for the token sequence is case sensitive, requires an exact-not partial-match, and only considers mentions. John Henry    of the CoNLL-2003 English data (Tjong Kim Sang and De Meulder, 2003) to improve accuracy as it contains a number of corrections. Tables 2, 3, and 4 give the total mentions of each entity type and the percentage that fall under the proposed unseen and TCM subsets for the three CoNLL datasets. 6 Across the three languages, 39.6%-54.6% of mentions are unseen, with the highest rate coming from PER mentions. UNSEEN-TYPE contains under 2% of mentions in English and Spanish and almost no mentions in Dutch; it is rare for a token sequence to only appear in training with types that do not appear with it in the test data.
Similarly, TCMs appear in the English (10.7%) 6 Tables for OntoNotes 5.0 English are provided in the appendix (Tables 16-17  and Spanish (6.3%) data, but almost never in Dutch (0.2%). The differences across languages with regards to TCMs may reflect morphology or other patterns that prevent the same token sequence from appearing with multiple types, but they could also be caused by the topics included in the data. In English, the primary source of TCMs is the use of city names as sports organizations, creating LOC-ORG confusion.

Models and Evaluation
We tested five recent mainstream NER neural architectures that either achieved the state-of-the-art performance previously or are widely used among the research community. 7 The models are CHAR-CNN+WORDLSTM+CRF 8 (CHARCNN), CHARLSTM+WORDLSTM+CRF 8 (CHARLSTM), CASED BERT-BASE 9 (Devlin et al., 2019), BERT-CRF 10 (Souza et al., 2019), and FLAIR (Akbik et al., 2018). 11 We trained all the models using the training set of each dataset. We fine-tuned English Cased BERT-Base, Dutch (Vries et al., 2019) and Spanish (Cañete et al., 2020) BERT models and used the model from epoch 4 after comparing development set performance for epochs 3, 4, and 5. We also fine-tuned BERT-CRF models using the training data, and used the model from the epoch where development set performance was the best within the maximum of 16 epochs.
All models were trained five times each on a single NVIDIA TITAN RTX GPU. The mean and standard deviation of scores over five training runs are reported for each model. It took approximately 2 hours to train each of FLAIR and NCRF++ on each of the CoNLL-2002/3 datasets, 12 hours to train FLAIR, and 4 hours to train NCRF++ on OntoNotes 5.0 English. It took less than an hour to fine-tune BERT or BERT-CRF models on each dataset. Hyperparameters for Spanish and Dutch models implemented using NCRF++ were taken from Lample et al. (2016). FLAIR does not provide hyperparameters for training CoNLL-02 Spanish, so we used 7 We could not include a recent system by Baevski et al.   those for CoNLL-02 Dutch. We did not perform any other hyperparameter tuning.

Baseline Results
We first examine the performance of these systems under standard evaluation measures. Tables 5 and 6 give performance on ReCoNLL and OntoNotes 5.0 English datasets using standard P/R/F1. In English, Flair attains the best F1 in both datasets, although BERT attains higher recall for OntoNotes. 12 BERT attains the highest F1 in Dutch (91.26) and Spanish (87.36); due to space limitations, tables are provided in the appendix (Tables 14-15). BERT-CRF performs similar or slightly worse than BERT in all languages, but generally attains lower standard deviation in multiple training runs, which suggests greater stability from using a CRF for structured predictions. The same observation also holds for Flair which also uses a CRF layer. We are not aware of prior work showing results from using BERT-CRF on English, Spanish, and Dutch. Souza et al. (2019) shows that the combination of Portuguese BERT Base and CRF does not show better performance than bare BERT Base, which agrees with our observations. F1 rankings are otherwise similar across languages. The performance of CharLSTM and CharCNN cannot be differentiated in English, but CharLSTM substantially outperforms CharCNN in Spanish (+2.53) and Dutch (+2.15). 12 We are not aware of any open-source implementation capable of matching the F1 of 92.4 reported by Devlin et al. (2019). The gap between published and reproduced performance likely stems from the usage of the "maximal document context," while reimplementations process sentences independently, as is typical in NER. Performance of Flair is slightly worse than that reported in the original paper because we did not use the development set as additional training data.

TMR for English
We explore English first and in greatest depth because its test sets are much larger than those of the other languages we evaluate, and we have multiple well-studied test sets for it. Additionally, the CoNLL-2003 English test data is from a later time than the training set, reducing train/test similarity.
Revised CoNLL English. One of the advantages of evaluating using TMR metrics is that systems can be differentiated more easily. Table 7 gives recall for type-confusable mentions (TCMs) on Re-CoNLL English. As expected, recall for TCMs is lower than overall recall, but more importantly, recall is less tightly-grouped over the TCM subsets (range of 8.17) than all mentions (1.76). This spread allows for better differentiation, even though there is a higher standard deviation for each score. For example, BERT-CRF generally performs very similarly to BERT, but scores 5.87 points lower for TCM-UNSEEN, possibly due to how the CRF handles lower-confidence predictions differently (Lignos and Kamyab, 2020). Flair has the highest all-mentions recall and the highest recall for TCMs, suggesting that when type-confusable mentions have been seen in the training data, it is able to effectively disambiguate types based on context. Table 8 gives recall for unseen mentions. Although Flair attains higher overall recall, BERT attains higher recall on UNSEEN-TYPE, the set on which all models perform their worst. While there are few (85) mentions in this set, making assessment of statistical reliability challenging, this set allows us to identify an advantage for BERT in this specific subset: a BERT-based NER model is better able to produce a novel type for a token sequence   only seen with other types in the training data.
OntoNotes 5.0 English. Examination of the OntoNotes English data shows that Flair outperforms BERT for type-confusable mentions, but BERT maintains its lead in overall recall when examining unseen mentions. Tables 9 and 10 give recall for type-confusable and unseen mentions. 13 Summary. Table 11 gives a high-level comparison between BERT and Flair on English data. Using the TMR metrics, we find that the models that attain the highest overall recall may not perform the best on tough mentions. However, the results vary based on the entity ontology in use. In a headto-head comparison between Flair and BERT on ReCoNLL English, despite Flair having the highest overall and TCM recall, BERT performs better than Flair on UNSEEN-TYPE, suggesting that BERT is better at predicting the type for a mention seen only with other types in the training data. In contrast, on OntoNotes 5.0 English, BERT attains the highest recall on UNSEEN mentions, but performs worse than Flair on TCMs. The larger and more precise OntoNotes ontology results in the unseen and type-confusable mentions being different than in the smaller CoNLL ontology. In general, Flair performs consistently better on TCMs while BERT performs better on UNSEEN mentions. 13 We do not display results for TCM-UNSEEN and UNSEEN-TYPE as they each represent less than 1% of the test mentions. BERT's recall for TCM-UNSEEN mentions is 19.51 points higher than any other system. However, as there are 41 mentions in that set, the difference is only 8 mentions. Tables 12 and 13 give recall for type-confusable  and unseen mentions for CoNLL-2002 Spanish and Dutch. 14 The range of the overall recall for Spanish (11.80) and Dutch (17.13) among the five systems we evaluate is much larger than in English (1.76), likely due to systems being less optimized for those languages. In both Spanish and Dutch, BERT has the highest recall overall and in every subset.

TMR for CoNLL-02 Spanish/Dutch
While our proposed TMR metrics do not help differentiate models in Spanish and Dutch, they can provide estimates of performance on subsets of tough mentions from different languages and identify areas for improvement. For example, while the percentage of UNSEEN-TYPE mentions in Spanish (1.8) and ReCoNLL English (1.5) is similar, the performance for BERT for those mentions in Spanish is 34.04 points below that for ReCoNLL English. By using the TMR metrics, we have identified a gap that is not visible by just examining overall recall.
Compared with ReCoNLL English (6.3%) and Spanish (10.7%), there are far fewer typeconfusable mentions in Dutch (0.2%). Given the sports-centric nature of the English and Spanish datasets, which creates many LOC/ORG confusable mentions, it is likely that their TCM rate is artificially high. However the near-zero rate in Dutch is a reminder that either linguistic or data collection properties may result in a high or negligible number of TCMs. OntoNotes English shows a similar rate (7.7%) to ReCoNLL English, but due to its richer ontology and larger set of types, these numbers are not directly comparable.

Conclusion
We have proposed Tough Mentions Recall (TMR), a set of evaluation metrics that provide a finegrained analysis of different sets of formalized mentions that are most challenging for a NER system. By looking at recall on specific kinds of "tough" mentions-unseen and type-confusable ones-we are able to better differentiate between otherwise similar-performing systems, compare systems using dimensions beyond the overall score, and evaluate how systems are doing on the most difficult subparts of the NER task.
We summarize our findings as follows. For    English, the TMR metrics provide greater differentiation across systems than overall recall and are able to identify differences in performance between BERT and Flair, the best-performing systems in our evaluation. Flair performs better on type-confusable mentions regardless of ontology, while performance on unseen mentions largely follows the overall recall, which is higher for Flair on ReCoNLL and for BERT on OntoNotes.
In Spanish and Dutch, the TMR metrics are not needed to differentiate systems overall, but they provide some insight into performance gaps between Spanish and English related to UNSEEN-TYPE mentions.
One challenge in applying these metrics is simply that there may be relatively few unseen mentions or TCMs, especially in the case of lowerresourced languages. While we are interested in finer-grained metrics for lower-resourced settings, data sparsity issues pose great challenges. As shown in Section 3.3, even in a higher-resourced setting, some subsets of tough mentions include less than 1% of the total mentions in the test set. We believe that lower-resourced NER settings can still benefit from our work by gaining information on pretraining or tuning models towards better performance on unseen and type-confusable mentions.
For new corpora, these metrics can be used to guide construction and corpus splitting to make test sets as difficult as possible, making them better benchmarks for progress. We hope that this form of scoring will see wide adoption and help provide a more nuanced view of NER performance.