Cross-Lingual Cross-Domain Nested Named Entity Evaluation on English Web Texts

Named Entity Recognition (NER) is a key Natural Language Processing task. However, most existing work on NER targets ﬂat named entities (NEs) and ignores the recognition of nested structures, where entities can be enclosed within other NEs. Moreover, evaluation of Nested Named Entity Recognition (NNER) across domains remains challenging, mainly due to the limited availability of datasets. To address these gaps, we present EWT-NNER, a dataset covering ﬁve web domains annotated for nested named entities on top of the English Web Treebank (EWT). We present the corpus and an empirical evaluation, including transfer results from German and Danish. EWT-NNER is annotated for four major entity types, including sufﬁxes for derivational entity markers and partial named entities, spanning a total of 12 classes. We envision the public release of EWT-NNER to encourage further research on nested NER, particularly on cross-lingual cross-domain evaluation.


Introduction
Named Entity Recognition (NER) is the task of finding and classifying named entities in text, such as locations, organizations, and person names. It is a key task in Natural Language Processing (NLP), and an important step for downstream applications like relation extraction, co-reference resolution and question answering. The task has received a substantial amount of attention. However, tools and existing benchmarks largely focus on flat, coarsegrained entities and single-domain evaluation.
Flat, coarse-grained entities however eschew semantic distinctions which can be important in downstream applications (Ringland et al., 2019). Examples include embedded locations ('New York Times'), entities formed via derivation ('Italian cuisine') and tokens which are in part named entities ('the Chicago-based company'). Research interest on methods to handle nested entities is increasing (Katiyar and Cardie, 2018). However, there is a lack of datasets, particularly resources which cover multiple target domains.
To facilitate research on cross-domain nested NER, we introduce a new layer on top of the English Web Treebank (EWT), manually annotated for NNER. The corpus spans five web domains, four major named entity types, enriched with suffixes marking derivations and partial NEs. Contributions The main contributions are: i) We introduce EWT-NNER, a corpus for nested NER over five web domains. ii) A report on cross-lingual and in-language baselines. Our results highlight the challenges of processing web texts, and the need for research on cross-lingual cross-domain NNER.

Related Work
Nested NER Much research has been devoted to flat Named Entity Recognition, with a long tradition of shared tasks (Grishman and Sund-heim, 1996;Grishman, 1998;Tjong Kim Sang and De Meulder, 2003;Baldwin et al., 2015). The problem of nested named entity recognition (NNER) has instead received less attention. This lack of breadth of research has been attributed to practical reasons (Finkel and Manning, 2009), including a lack of annotated corpora (Ringland et al., 2019).
Existing nested NE corpora span only a handful of languages and text domains. This is in stark contrast to resources for flat NER, which are available for at least up to 282 languages (Pan et al., 2017) and multiple domains, including a very recent effort (Liu et al., 2021). Existing NNER resources for English cover newswire (e.g., ACE, WSJ) (Mitchell et al., 2005;Ringland et al., 2019) and biomedical data (e.g., GENIA) Alex et al., 2007;Pyysalo et al., 2007). Beyond English, there exist free and publicly available nested NER datasets. These include the GermEval 2014 dataset (Benikova et al., 2014a), which is one of the largest existing German NER resources covering largely news articles (Benikova et al., 2014b). Recently, the GermEval annotation guidelines inspired the creation of a Danish corpus (Plank et al., 2020). They added a layer of nested NER on top of the existing Danish Universal Dependency treebank (Johannsen et al., 2015). Both German and Danish corpora derive their annotation guidelines from the NoStA-D annotation scheme (Benikova et al., 2014b), which we adopt for EWT-NNER (Section 3.1). To facilitate research, a fine-grained nested NER annotation on top of the Penn Treebank WSJ has been released recently (Ringland et al., 2019). In contrast to ours, the WSJ NNER corpus spans 114 entity types and 6 layers, and includes numericals and time expressions beyond named entities. We instead focus on NEs with a total of 12 classes and 2 layers.
As outlined by Katiyar and Cardie (2018), nested named entities are attracting more research attention. Modeling solutions opt for diverse strategies, from hierarchical systems to graph-based methods and models based on linearization (Alex et al., 2007;Finkel and Manning, 2009;Sohrab and Miwa, 2018;Luan et al., 2019;Lin et al., 2019;Zheng et al., 2019;Straková et al., 2019;Shibuya and Hovy, 2020). The current top-performing neural systems use typically either a linearization, a multi-task learning or a graph-based approach (Straková et al., 2019;Plank et al., 2020;Yu et al., 2020). We evaluate two such methods.
English Web Treebank The English Web Treebank (EN-EWT) (Bies et al., 2012;Petrov and Mc-Donald, 2012;Silveira et al., 2014) is a dataset introduced as part of the first workshop on Syntactic Analysis of Non-Canonical Language (SANCL). The advantage of EWT is that it spans over 200k tokens of texts from five web domains: Yahoo! answers, newsgroups, weblogs, local business reviews from Google and Enron emails. Gold annotations are available for several NLP tasks. The corpus was originally annotated for part-of-speech tags and constituency structure in Penn Treebank style (Bies et al., 2012). Gold standard dependency structures were annotated on EWT via the Universal Dependencies project (Silveira et al., 2014). Recently, efforts extend EWT (or parts thereof) to further semantic (Abend et al., 2020) and temporal layers (Vashishtha et al., 2019). We contribute a novel nested NER layer on top of the freely available UD EN-EWT corpus split (Silveira et al., 2014).

The EWT-NNER corpus
This section describes the corpus and annotation.

Annotation Scheme and Process
We depart from the NoSTA-D named entity annotation scheme (Benikova et al., 2014b), introduced in the GermEval 2014 shared task and adopted for Danish (Plank et al., 2020). The entity labels span a total of 12 classes, distributed over four major entities (Tjong Kim Sang and De Meulder, 2003): location (LOC), organization (ORG), person (PER) and miscellaneous (MISC). There are two further sub-types: '-part' and '-deriv'. Entities are annotated using a two-level scheme. First-level annotations contain largest entity spans (e.g., the 'Alaskan Knight'). Second-level annotations are nested entities. In particular: • We annotate named entities with two layers. The outermost layer embraces the longer span and is the most prominent entity reading, and the inner span contains secondary or sub-entity readings. If there would be more than 2 layers, we drop further potential readings in favor of keeping two layers. • Only full nominal phrases are potential full NEs. Pronouns and all other phrases are ignored. National holidays or religious events (Christmas, Ramadan) are also not annotated. Determiners and titles are not part of NEs.
• Named entities can also be part of tokens and are annotated as such with the suffix "part". Example: • Derivations of NEs are marked via the suffix deriv, e.g., 'the [Alaskan]LOCderiv movie'.
• Geopolitical entities deserves special attention. We opted for annotating its first reading as ORG, with a secondary LOC reading, to reduce ambiguity. This is the same as in the Danish guidelines. The original German NoStA-D guidelines did not provide detailed guidelines for this case, yet mentions some categories were conflated (LOC and geopolicitcal entities). They seem most frequently annotated as LOC, yet we find also similar annotations in the German data (especially for multi-token NEs like Borussia Dortmund).
The full annotation guidelines for EWT-NNER with examples, annotation decisions and difficult cases can be found in the accompanying repository. 2 Two annotators were involved in the process, both contributed to the earlier Danish corpus. One annotator is an expert annotator with a degree in linguistics; the second annotator is a computer scientist. A data statement is provided in the appendix. Inter-annotator agreement (IAA) is measured on a random sample of 100 sentences drawn from the development data. This resulted in the following agreement statistics: raw token-level agreement of 98%, Cohen's kappa over all tokens 88.20% and Cohen's kappa of 81.73% for tokens taking part in an entity as marked by at least one annotator. The final dataset was annotated by the professional linguist annotator. The annotation took around 3 working days per 25,000 tokens.

Data statistics
Statistics over the data split and distribution of the web texts are provided in Table 1. A comparison to German and Danish on coarse-level statistics are provided in Table 3. Details of the entity distibution in EWT-NNER are given in Table 2. The entire EWT-NNER contains a total of over 16,000 sentences and over 13,000 entities. Around 42% of the sentences contain NEs. Over 11.6% are nested NEs, 8.3% are derivations and 1.1% are parts of names. Compared to GermEval 2014, this is a higher density of nested entities (11.6% vs 7.7% in the German data), yet a lower percentage of derivations and partial NEs. The data is provided in CoNLL tabular format with BIO entity encoding.

Experimental Setup
We are interested in a set of benchmark results to provide: a) zero-shot transfer results from Danish and German; b) in-language results (training on all 5 EN domains vs per-domain models); and c) results on cross-lingual cross-domain evaluation when training on multiple languages jointly.
For the experiments, we use fine-tuning of con-  As contextualized embeddings, we investigate BERT (Devlin et al., 2019), multilingual BERT and XLM-R (Conneau et al., 2020). We evaluated two decoding strategies: the first takes the Cartesian product of inner and outer NER layer and treats it as a standard single-label decoding strategy. An advantage of this strategy is that any sequence tagging framework can be used; a disadvantage is the increased label space. To tackle this we use a two-headed multi-task decoder, one for each entity layer, as found effective (Plank et al., 2020). Initial experiments confirmed that the single-label decoding is less accurate, confirming earlier findings (Straková et al., 2019;Plank et al., 2020). We report results with the two-headed decoder only, and further results in the appendix.
Evaluation is based on the official GermEval 2014 (Benikova et al., 2014b) metric and script, i.e., strict span-based F1 over both entity levels. Table 4 shows the results of training models on German (DE: GermEval 2014), Danish (DA: DaN+), and their union (+), for zero-shot transfer (top rows). It provides further results of training on all English EWT-NNER training data (from all five web domains) both for multilingual models (using multilingual BERT or XLM-R) and monolingual models (English BERT and Roberta).    Take-aways While zero-shot transfer between news (on German and Danish) is around 70 F1 (68.5 and 76.7), zero-shot transfer to the EWT-NNER web domains is low, particularly for answers ( ), reviews ( ), emails (ƀ) and newsgroups ( ). Training on both Danish and German improves zero-shot performance over all domains. For English cross-domain evaluation, we observe a large variation across domains in Figure 2.

Results
Here, we train models on the EWT-NNER training portion of a single web domain, and evaluate the resulting model across all five web domains (indomain and out-domain). The heatmap confirms that training within domain is the most beneficial (results on the diagonal), but large drops can be observed across domains. Reviews ( ) and Yahoo! answers ( ) remain the most challenging with the lowest F1 scores. Weblogs (ɯ) shows the highest results. We tentatively attribute this to the good coverage of weblogs over all entity classes (see Table 2) and the well-edited style of the text (by inspection many posts are about politics and military events). If we compare the results to the model trained on all English data in Table 4 (EN all 5), we observe that training on all web training data improves over the single web texts.
We investigate cross-lingual cross-domain results, to evaluate whether a model trained on English data alone can be improved by further crosslingual transfer. Table 4 shows that this is the case. There is positive transfer from German and Danish data, with the mBERT model (EN+DA+DE) boosting performance (on most domains). The larger XLM-R model helps on specific domains, but it is not consistently better than mBERT.
So far we focused on multilingual contextualized embeddings. The last rows in Table 4 compares the multilingual models to monolingual ones. Interestingly, in this domain a monolingual model does not consistently outperform the multlingual model. While for some domains the EN model is substantially better, this is not the case overall. On average over the 5 web domains, the tri-lingual model with mBERT reaches a slightly overall F1 (average of 78.99), followed by both the monolingual BERT model (78.64) and XLM-R (78.63).  Test sets Finally, we run the best model on the test sets and compare to training on English alone. Table 5 confirm the overall trends. There is a positive transfer across languages for cross-domain evaluation, with improvements on the majority of domains. The best model reaches an average F1 score of 78.01 on the five web domains. Compared to results within newswire, there is room to improve NNER over domains.
Analysis We perform a qualitative analysis of the best model (EN+DA+DE) on the dev sets.
Detailed scores are in Table 7 in the appendix. The overall F1 of 72% on is largely due to low recall on person names (recall 63%) (e.g., peculiar names such as 'Crazy Horse', a Dakota leader) and missed lower-cased product names ('ipod'). On ƀ, recall on ORG and LOC is low (55% and 65%), as organizations and locations are missed also due to unconventional spelling in emails. In reviews ( ), the model reaches its lowest F1 on ORG (67%) as it mixes up people names with organizations and lacks recall. Newsgroup ( ) is broad (e.g., discussions from astronomy to cat albums) with the lowest per-entity F1 of 75% for MISC. Newsgroup and weblogs are the domains with the most LOCderiv entities, which the model easily identifies (F1 of 93% and 99% in and ɯ, respectively). Overall, weblogs (ɯ) has the highest per-entity F1 scores, all above 75%, with the highest overall F1 on LOC (92 F1; in comparison to 57% on ƀ and 79% on ). This high result on weblogs can be further attributed to smaller distance to the training sources (as indicated in the overlap plot in Figure 1) and to some degree of using this domain for tuning. From a qualitative look, we note that the weblogs sample is rather clean text, often in reporting style about political events similar to edited news texts, which we believe is part of the reason for the high performance compared to the other domains in EWT-NNER.

Conclusions
We present EWT-NNER, a nested NER dataset for English web texts, to contribute to a limiting nested NER resource landscape. We outline the dataset, annotation guidelines and benchmark results. The results show that NNER remains challenging on web texts, and cross-lingual transfer helps. We hope this dataset encourages research on cross-lingual cross-domain NNER. There are many avenues for future research, which include e.g., alternative decoding (Yu et al., 2020), pre-training models and adaptation (Gururangan et al., 2020).

A Data Statement
This following data statement (Bender and Friedman, 2018) documents the origin of the data annotations and provenance of the original English Web Treebank (EWT) data.
CURATION RATIONALE Annotation of nested named entities (NNE) in web text domains to study the impact of domain gap on cross-lingual transfer.
LANGUAGE VARIETY Mostly US (en-US) mainstream English as target. Transfer from Danish (da-DK) and German (de-DE).
SPEAKER DEMOGRAPHIC Unknown. ANNOTATOR DEMOGRAPHIC Native languages: Danish, German. Socioeconomic status: higher-education student and university faculty.
SPEECH SITUATION Scripted, spontaneous. TEXT CHARACTERISTICS Sentences from journalistic edited articles and from social media discussions and postings.
PROVENANCE APPENDIX The data originates from the English Web Treebank (EN-EWT) (Bies et al., 2012;Petrov and McDonald, 2012;Silveira et al., 2014) and data split available at: https://github. com/UniversalDependencies/UD_ English-EWT/ B Additional results Table 6 provides additional results for both decoding strategies. It shows that single-label decoding is outperformed by the two-head decoder, confirming similar results on Danish (Plank et al., 2020).   Table 7: Per-entity evaluation of outer level strict FB1 score (and recall) of the best model EN+DE+DA with mBERT on the dev sets.