MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)

Named Entity Recognition (NER) is the task of identifying named entities in texts and classify-ing them through specific semantic categories, a process which is crucial for a wide range of NLP applications. Current datasets for NER focus mainly on coarse-grained entity types, tend to consider a single textual genre and to cover a narrow set of languages, thus limit-ing the general applicability of NER systems. In this work, we design a new methodology for automatically producing NER annotations, and address the aforementioned limitations by introducing a novel dataset that covers 10 languages, 15 NER categories and 2 textual genres. We also introduce a manually-annotated test set, and extensively evaluate the quality of our novel dataset on both this new test set and standard benchmarks for NER. In addition, in our dataset, we include: i) disambiguation information to enable the development of multilingual entity linking systems, and ii) image URLs to encourage the creation of multimodal systems. We release our dataset at https://github. com/Babelscape/multinerd .


Introduction
Named Entity Recognition (NER) represents a milestone in information extraction, and its aim is to identify and classify key information in unstructured texts, i.e. named entities (Nadeau and Sekine, 2007). It is widely used in a broad spectrum of downstream applications, like machine translation (Babych and Hartley, 2003), question answering (Mollá et al., 2006), automatic text summarization (Aone et al., 1998), and entity linking (Martins et al., 2019), inter alia.
With the advent of pretrained language models like BERT (Devlin et al., 2019) or LUKE (Yamada et al., 2020) -this latter with a particular focus on named entities -the NER field observed astonishing results on conventional benchmarks (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003). However, such benchmarks are limited in size, cover a single textual genre, and are available only for a narrow set of languages. Moreover, they focus on coarse-grained entity types, and often overlook more complex entities like titles of books, songs and movies. These latter entities are not simple nouns and can be both syntactically and semantically ambiguous. Specifically, they can assume the form of any linguistic constituent (e.g. Singin' in the Rain) which makes them difficult to extract. Indeed, in the last decade, the OntoNotes 5.0 dataset (Weischedel et al., 2013) has become very popular thanks to its high quality, size and finegrained categories. Nevertheless, it covers only 3 languages, namely, English, Arabic and Chinese.
Since the manual creation of training data for NER is expensive and time-consuming -especially when many languages have to be covered -several studies have tried to address data scarcity by producing training data automatically (Nothman et al., 2013;Al-Rfou et al., 2015;Tsai et al., 2016;Pan et al., 2017), recently showing that automaticallygenerated annotations can boast a quality comparable to that of manually-created ones (Tedeschi et al., 2021b). Unfortunately, although these studies have considered a wider range of languages, they have still focused on coarse-grained entities and on a single textual genre, i.e. encyclopedic texts from Wikipedia 1 .
In this paper, inspired by the success of the OntoNotes 5.0 dataset and by recent achievements in automatic data creation, we fill the aforementioned gaps and propose the following novel contributions: 1. We design a new language-agnostic methodology for automatically generating high-quality and fine-grained NER annotations by exploiting the texts of Wikipedia and Wikinews 2 ; 2. We introduce a novel automatically-created benchmark for NER that covers 10 languages, 15 entity types and 2 textual genres, together with a small manually-curated test set for the English language; 3. We extensively evaluate the quality of the data produced on both our manually-annotated test set and standard benchmarks for NER.
Additionally, although in this work we focus on NER, we also contribute to the entity disambiguation (also known as entity linking) task, i.e. the task of linking entities mentioned in texts with their corresponding entry in a knowledge base. Specifically, for a given entity, we provide disambiguation information together with its NER tag in order to enable training, validation and testing of multilingual entity linking models. Finally, we also include image URLs to encourage the creation of multimodal systems. To enable comparability on our benchmark, we release our data and software at https: //github.com/Babelscape/multinerd.

Gold-Standard Data
High-quality annotations are essential for both learning and evaluation of NER systems. Indeed, in the past few decades a large number of NER datasets have been proposed. Initially, the MUC-6 and MUC-7 shared tasks focused on entity names (i.e. persons, locations and organizations), temporal expressions (i.e. dates and times) and number expressions (i.e., currency values and percentages), but only English newswire articles were considered (Grishman and Sundheim, 1996;Chinchor and Robinson, 1997).
A few years later, different datasets were derived from Reuters News for the CoNLL-2002 and shared tasks on language-independent Named Entity Recognition (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003), covering four different languages (i.e. Dutch, English, German and Spanish). However, these datasets were limited in size and only four coarse-grained entity types were considered: Person, Location, Organization and Miscellaneous 3 . Nonetheless, these datasets are still widely used to benchmark NER systems. Balasuriya et al. (2009) claimed that NER was needed in many domains beyond newswire texts, and introduced WikiGold, a manually-annotated dataset derived from Wikipedia articles. Even so, WikiGold covered coarse-grained entities, was limited in size and considered only the English language. Following the same motivation as Balasuriya et al. (2009), Ritter et al. (2011 introduced a dataset for English tweets using 10 NER classes. Another considerable step forward was made by Weischedel et al. (2013), who introduced OntoNotes 5.0. This dataset covered 18 finegrained classes, multiple genres (e.g. newswire and weblogs), and multiple languages (English, Chinese, and Arabic). Thanks to its high quality, it is one of the most widely used datasets for NER.
Finally, another notable dataset was proposed for the WNUT 2017 shared task on emerging and rare entities, covering different textual genres (tweets, YouTube comments, Reddit and StackExchange posts) (Derczynski et al., 2017). However, only the English language and 6 categories were considered.

Silver-Standard Data
Although the OntoNotes 5.0 dataset constitutes a valuable resource for training and evaluating multilingual and fine-grained NER systems, its applicability is limited to the three languages it covers. Indeed, in the last decade, with the aim of scaling NER to a wider set of languages, more interest has been devoted to automatic data creation.
The first successful attempt in this direction was made by Nothman et al. (2013) who produced the WikiNER dataset. They proposed a strategy for automatically creating multilingual training data for NER by exploiting the texts of Wikipedia and its hypertext organization. In addition, they also used redirect-base heuristics to infer more namedentity mentions. By applying this methodology, they covered 9 languages, but they still focused on the standard coarse-grained entity types.
Adopting a similar strategy, Pan et al. (2017) introduced WikiANN, a language-independent framework for extracting entities from documents. Their procedure was made up of two main steps: i) classify entries in the English Wikipedia into specific entity types, and ii) propagate the annotations to other languages by applying cross-lingual transfer. This procedure yielded massive corpora consisting of 282 languages, but with lower annotation quality and a focus on persons, locations, organizations and geo-political entities.
Finally, Tedeschi et al. (2021b) proposed WikiNEuRal, an annotation pipeline that effectively combined recent pretrained language models with knowledge-based approaches, and produced high-quality annotations for NER in 9 languages by exploiting Wikipedia texts. Surprisingly, the authors showed that their methodology, in 2 out of 3 settings, produced annotations with a quality even higher than that of manual ones. However, again, only coarse-grained entities were considered. Despite automatic methods achieved high annotation quality and covered many languages, all of them focused on coarse-grained entities and on a single textual source: Wikipedia. On the other hand, gold-standard datasets focused mainly on the English language. Additionally, none of them included disambiguation information. Evidently, a unified effort to obtain a large-scale multilingual, multi-genre and fine-grained resource for Named Entity Recognition and Disambiguation is still missing.

NER Classes
Our 15 NER classes are a subset of the newly introduced 18 classes of Tedeschi et al. (2021a) designed to reduce the intrinsic sparsity of the Entity Linking task. Specifically, they are We prefer these classes to the OntoNotes ones, because they cover a wider range of macro categories. For instance, the OntoNotes' PRODUCT class -which groups very heterogeneous entities -is split into FOOD, VEHI and INST. Over and beyond these, our new set contains animals, plants, biological entities, celestial bodies, diseases and mythological entities that are not present in OntoNotes. Table 1

MultiNERD
In this Section we describe our language-agnostic strategy for automatically generating a fine-grained and multilingual resource to train robust NER and ED systems. Specifically, our methodology widely extends previous state-of-the-art strategies, and it is characterized by the following five steps: i) preprocessing of Wikipedia and Wikinews articles (Section 4.1), ii) identification of entities (Section 4.2), iii) tagging the identified entities with the NER labels (Section 4.3), iv) propagation of the annotations (Section 4.3), v) enhancement of the annotations (Section 4.4).

Wikitext Preprocessing
Wikipedia and Wikinews articles provide plenty of manually-curated information that can be exploited for the automatic annotation of sentences, i.e. Wikilinks 4 . However, in addition to Wikilinks, articles may contain elements (e.g. images, tables, formulas and lists) and sections (e.g. see also, references, further readings) that do not correspond to well structured text; therefore, we remove them with the intent of reducing noise. This step converts articles to plain texts containing only Wikilinks.

Entity or Concept?
Wikilinks provide potential entity mentions. Indeed, some of them correspond to entities (e.g. Elon Musk), and others correspond to concepts (e.g. Table). In order to distinguish between them, we take advantage of the one-to-one linkage between Wikipedia and BabelNet 5 (Navigli and Ponzetto, 2012;Navigli et al., 2021) and exploit the concept-vs.-entity categorization provided therein. Although it is evident that for some categories we are only interested in entities in the strictest sense (e.g. PER, ORG and LOC), we need to relax this constraint for other classes (e.g. ANIM, PLANT, FOOD and DIS). Thus, in order to extract animals (e.g. Labrador Retriever), plants (e.g. Pinus), food (e.g. Carbonara) and diseases (e.g. Alzheimer's disease), among others, we also need to consider elements that are labeled as concepts in BabelNet. This step tells us which Wikilinks have to be annotated with an entity type, and which of them have to be discarded. The full list of design choices is provided in Appendix A.

Tagging Wikipedia and Wikinews Articles
Semantic Classifier We now aim at providing each (remaining) Wikilink in a Wikipedia (or Wikinews) article with a category c ∈ C, where C  is the set of the NER classes introduced in Section 3.
To do this, we introduce a Semantic Classifier that exploits the one-to-one correspondence between Wikipedia pages and BabelNet synsets. We start by manually-annotating 300 synsets to cover as many high-order concepts of the WordNet 6 (Miller, 1995) nominal taxonomy -which is a subset of the BabelNet taxonomy -as possible. For instance, we label the following high-level synsets as follows: • animal (bn:00004222n) → ANIM; • company (bn:00021286n) → ORG; • town (bn:00077773n) → LOC; Then, to propagate these annotations to all other synsets in WordNet, we descend through its taxonomy by following hyponymy and has-instance relationships (i.e. parent-to-child relations). For example, all the children of animal (bn:00004222n), e.g. dog (bn:00015267n), inherit the ANIM tag. This step yields 40k high-quality annotated synsets. At this point, to annotate a Wikilink l in a Wikipedia (or Wikinews) article w, we retrieve its corresponding synset s from BabelNet, and we follow hypernymy relations (child-to-parent relations) until one or more of the 40k synsets in the expanded set is reached 7 . Here, we distinguish between two possible cases: 1. When a single ancestor is reached, or when all the ancestors share the same class, the corresponding annotation is just inherited. For instance, starting from Apple Inc. (bn:03739345n) and climbing the taxonomy, we find only company (bn:00021286n) at distance 1, hence Apple Inc. inherits the ORG annotation.
2. When two or more ancestors have discordant annotations, then the highest-scoring class 8 is assigned. Formally, for each NER class c ∈ C, the score is computed as follows:

Token
where A c is the set of all the ancestors of synset s with tag c, and d(a) is a function that returns the distance of a from s in the Babel-Net taxonomy. As an example, consider the synset Bill Gates (bn:00010401n).
When climbing the taxonomy, at distance 2, the hominid (bn:00044571n) synset is reached, which is classified as ANIM. However, the Bill Gates' synset is also child of human (bn:00044576n), computer scientist (bn:00021495n) and magnate (bn:00008639n), resulting in a highest score for the PER class.
This procedure allows us to label each Wikilink in a given article with a NER class.
Tag Propagation Wikipedia and Wikinews guidelines specify that only the first mention to a certain article has to be linked. This implies that tagging only Wikilinks leads to sparse annotations. To cope with this issue, we employ a simple yet effective exact-match heuristic in which for each Wikilink l, with an associated class c, we assign the class c to all the expressions e i in the same document of l such that e i = l ∨ e i ∈ syn(l), where syn(l) is a function that returns the synonyms of l from BabelNet. Finally, the annotations are converted to BIO format 9 .
The above-described methodology allows us to have at the same time, for a given entity, both the 9 The BIO tagging scheme (short for Beginning, Intermediate, Out) is a popular format for handling spans of tokens. M D = train(D) 9: end for 10: return D NER annotation and the BabelNet synset. Then, as already mentioned, through BabelNet we can easily access other resources, and retrieve the corresponding Wikipedia and Wikidata pages. Hence, in our dataset we include disambiguation information from the three above-mentioned knowledge bases. Additionally, for a given entity, we include also the corresponding definition and the main image from Wikidata, where this latter can be used to develop multimodal NER and entity linking systems. An instance of our dataset is provided in Table 2.

Annotation Enhancement
The above steps enable multilingual and finegrained annotations to be created. However, these annotations are derived automatically and, therefore, they may contain errors. Tedeschi et al. (2021b) improved the quality of the annotations by combining them with the  predictions of a Transformer-based neural classifier (mBERT + Bi-LSTM + CRF, Mueller et al., 2020). Unfortunately, this strategy requires preexisting annotated data in the same set of languages and with the same NER tags in order to train the NER classifier, and these are not available in our case. To cope with this issue, we employ the same Transformer-based architecture but drop the requirement of pre-existing training data by introducing a general, straightforward iterative strategy to jointly improve both the performance of the neural model and the quality of the data produced. Algorithm 1 illustrates the procedure. Essentially, it starts by taking a set A of n articles from W (line 1), and annotating them with the steps described in In the annotate(A, M D ) function, the neural model M D is used to refine the annotations and reduce noise. Specifically, if the NER class of an entity predicted by the neural model is different from the one assigned by the knowledge-based approach (Section 4.3), the corresponding sentence is discarded.

Experiments
In this Section, we describe our experimental setup (Section 5.1), the datasets 10 we use to train (Section 5.2) and evaluate (Section 5.3) our methodology, and finally the results obtained (Section 5.4).

Experimental Setup
In our experiments, we evaluate the quality of our data in two different settings: 1. In order to compare our dataset against previous state-of-the-art automatically-created datasets, we map our fine-grained annotations to the coarse-grained classes used by these datasets. Then, we train the mBERT + Bi-LSTM + CRF model introduced in Section 4.4 on both our dataset and the other abovementioned datasets. Finally, we compare the performance of the corresponding systems on gold-standard benchmarks for NER. Appendix A provides class mapping details; 2. To measure the quality of our fine-grained annotations, we manually annotate a random  Table 4: Span-based micro F 1 scores obtained by training a reference NER system on different automatically-created datasets (i.e. WikiANN, WikiNER, WikiNEuRal, MultiNERD) and testing on common NER benchmarks.
sample of 1K English sentences, and compare the annotations produced by our methodology with the corresponding ground truths.
We implement our model with PyTorch using the Transformers library (Wolf et al., 2019) to load the weights of BERT-base-multilingual-cased (mBERT), and train each model configuration with an early stopping strategy using a patience value of 5. We use Adam optimizer (Kingma and Ba, 2015) with learning rate of 10 −3 and a cross-entropy loss criterion. We repeat each training on 10 different seeds, fixed across experiments, and report the mean of their span F 1 scores computed with the official conlleval script. Further details are provided in Appendix B.

Training Data
We train our reference model with four different silver-standard datasets: • MultiNERD: the resource created using the steps described in Section 4 from Wikipedia and Wikinews articles 11 , with n = 30K and t = 8. It covers 10 languages: Chinese, Dutch, English, French, German, Italian, Polish, Portuguese, Russian and Spanish. Statistics are shown in Table 3.
• WikiNEuRal (Tedeschi et al., 2021b): the current best-performing approach for NER silver data creation. It covers 9 languages (i.e. Dutch, English, French, German, Italian, Polish, Portuguese, Russian and Spanish), and sentences are extracted from Wikipedia.
• WikiNER (Nothman et al., 2013): a highquality automatically-derived dataset for NER from Wikipedia. It covers the same languages as WikiNEuRal. 11 We use the April 2021 snapshot for both Wikipedia and Wikinews dumps, sampling random articles.
• WikiANN 12 (Pan et al., 2017): a massive dataset for NER consisting of Wikipedia documents annotated in 282 languages.
All datasets are tagged with the four standard entity types (PER, ORG, LOC, MISC), except for WikiANN which does not contain the MISC label. Indeed, when evaluating the WikiANN dataset, only the PER, ORG and LOC classes are evaluated.

Test Data
Common Benchmarks For our first setting (Section 5.1), we use 5 common gold-standard test sets: • CoNLL-2002(Tjong Kim Sang, 2002Tjong Kim Sang and De Meulder, 2003): a well-known corpus of NERannotated newswire articles for Dutch, English, German and Spanish; • WikiGold (Balasuriya et al., 2009): a set of human-annotated English Wikipedia articles.
• OntoNotes 5.0 (Pradhan et al., 2012): a popular dataset for NER that includes texts from different textual genres and multiple domains.
• BSNLP-2017 (Piskorski et al., 2017): a no-  Manual Annotation For our second experimental setting (Section 5.1), due to the absence of NER benchmarks that use our set of categories (Section 3), we conduct a manual evaluation to assess the quality of our dataset. Specifically, we randomly select 13 a sample of 1K English sentences, preannotated with the NER tags produced using our methodology (Section 4), and confirm or replace the annotations associated with each token in the dataset. The resulting gold-standard dataset is used to analyze the quality of our silver-standard data.

Results
Coarse-Grained Evaluation In our first setting (Section 5.1), we measure the effectiveness of our methodology by comparing the quality of the data produced against that of other datasets created using previous state-of-the-art strategies for NER silver-data creation (i.e. datasets listed in Section 5.2). Since past approaches focused on coarsegrained entities, we can compare the quality only for such entity types. The results obtained are shown in Table 4. Although our dataset covers a wider range of categories than its competitors, it nevertheless outperforms all of them on all tested 13 We ensure that the dataset contains a sufficient number n of instances for each NER class. We set n = 20. Statistics are provided in the "Support" column of  datasets and languages. We attribute this advancement mainly to the self-improvement algorithm introduced in Section 4.4, which iteratively refines the annotations using a better model at each iteration. To demonstrate the impact of our algorithm, we construct baseline versions of MULTINERD for DE, EN, ES, NL, PL and RU with the same sizes as the corresponding refined versions, but without using our enhancement procedure. As can be observed from Table 4, the refined versions provide an average improvement of almost 7 F 1 points. In addition, the wider number of textual genres covered by MULTINERD leads to more robust systems.
Fine-Grained Evaluation Although the coarsegrained evaluation conducted in the previous Section demonstrated that our MULTINERD methodology creates high-quality annotations, independently of the language, it is not sufficient to understand how our annotation pipeline performs on fine-grained classes. Indeed, to measure this, we use a sample of 1K English sentences manuallyannotated with fine-grained entities, as explained in Section 5.3, and report the results in Table 5. As expected, the PER, ORG and LOC classes are among the best-performing classes. Similarly, celestial bodies, diseases, events and media also have very high performance, thanks to their occurrences being almost always linked in Wikipedia and Wikinews articles (high recall) and easily distinguishable (high precision). In contrast, animals, biological entities, foods and plants have a highdegree of confusion (lower precision), and are very often not linked (low recall). To better explain this, we report in Figure 1 the confusion matrix of the silver-standard annotations produced by our approach compared to the gold-standard ones. As an example, it can be observed that animals and plants are often confused with each other, mainly because their scientific names are morphologically very close. Similarly, animals and plants are also confused with foods (e.g. Alaskan salmon and Quinoa), and vice versa.
Even though the quality of the annotations produced by our approach for any particular one of the 10 languages covered herein is strongly dependent on the quality of the corresponding Wikipedia and Wikinews dumps, we expect comparable performance on all other languages, as suggested by the statistics in Table 3 which show strong consistency across languages.

Conclusions
In this work we introduced MULTINERD, a novel resource for training robust multilingual and finegrained Named Entity Recognition (and Disambiguation) systems. To create it, we presented a new language-agnostic strategy for generating high-quality silver-standard NER and ED annotations. This strategy uses a knowledge-based semantic classifier to automatically annotate Wikipedia and Wikinews articles, and then iteratively enhances the annotations produced by means of a selfimprovement algorithm which builds upon neural models. Our experiments showed that MULTIN-ERD outperformed previous state-of-the-art dataproduction methods across all tested languages and domains, while covering a much wider set of NER categories. Additionally, we also included image URLs in our dataset to encourage the development of multimodal NER and ED systems. This visual information could also be exploited to further improve the quality of the annotations by ensembling the predictions of NLP and Computer Vision models. We release our MULTIN-ERD dataset and software at https://github. com/Babelscape/multinerd.