IXA/Cogcomp at SemEval-2023 Task 2: Context-enriched Multilingual Named Entity Recognition Using Knowledge Bases

Named Entity Recognition (NER) is a core natural language processing task in which pre-trained language models have shown remarkable performance. However, standard benchmarks like CoNLL 2003 do not address many of the challenges that deployed NER systems face, such as having to classify emerging or complex entities in a fine-grained way. In this paper we present a novel NER cascade approach comprising three steps: first, identifying candidate entities in the input sentence; second, linking the each candidate to an existing knowledge base; third, predicting the fine-grained category for each entity candidate. We empirically demonstrate the significance of external knowledge bases in accurately classifying fine-grained and emerging entities. Our system exhibits robust performance in the MultiCoNER2 shared task, even in the low-resource language setting where we leverage knowledge bases of high-resource languages.


Introduction
The research on the Named Entity Recognition field has mainly focused on recognizing entities from news corpora. Datasets like CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003) or OntoNotes (Hovy et al., 2006) have been the standard benchmarks for a long time. It is not difficult to find systems that perform over 92% of F1 score on those datasets. However, it is known that actual models perform pretty well on relatively easy entity types (such as person names) but have been struggling to extract more complex entities (Luken et al., 2018;Hanselowski et al., 2018). Moreover, the evaluation setup usually involves a large overlap on the entities from the train and test sets, overestimating the systems performance. Removing such overlap resulted on a large performance drop, consequently showing that models tend to memorize rather than generalize to complex or unseen entities (Meng et al., 2021;Fetahu et al., 2022).
Motivated by these insights, recent works have developed new and more complex datasets like MultiCoNER (Malmasi et al., 2022a) or UFET (Choi et al., 2018) that involve complex entity mentions with higher granularity on the type definition. To address these new challenges large language models (LLMs) are commonly used, with the hope that after being trained on billions of tokens the models become aware of the relevant entities. Despite the capacity of the models to memorize information about each individual entity, they face a harsh reality: the knowledge that a model can memorize is limited by the date on which the model was pre-trained. So, new emerging entities can be difficult to categorize for models trained years ago. Moreover, there are cases when it is not possible to identify the fine-grained category of an entity just from the context, and thus requires prior knowledge about the entity. For example: George Bernard Shaw, Douglas Fairbanks, Mary Pickford and Margaret Thatcher are some of the famous guests who stayed at the hotel.. From the previous sentence, it is impossible to categorize George Bernard Shaw and Margaret Thatcher as politicians neither Douglas Fairbanks and Mary Pickford as artist. We hypothesize that a possible solution for this problem is to allow the models access updated Knowledge Bases, and use the updated information to infer the correct type of the entity of interest.
To overcome the mentioned challenges, in this work we present a NER approach (see Figure 1) that (1) identifies possible entity candidates by analyzing the input sentence structure, (2) links the candidate to an existing updated knowledge base if possible, and (3) performs the fine-grained classification using the input sentence plus the retrieved information from the KB about the entity. The approach allowed us to perform fine-grained NER with updated world knowledge when possible, and standard NER otherwise. Our code is publicly available to facilitate the reproducibility of the results and its use in future research 1 .

Related Work
Named Entity Recognition (NER) (Grishman and Sundheim, 1996) is a core natural language processing task where the goal is to identify entities belonging to predefined semantic types such as organizations, products and locations. Since its inception, different approaches have been developed, from statistical machine learning methods (Zhou and Su, 2002;Konkol and Konopík, 2013;Agerri et al., 2014) to the ones based on neural networks (Strubell et al., 2017;Xia et al., 2019) and the mix of both (Huang et al., 2015;Chen et al., 2017). Recently, the use of richer contextual embeddings computed via Transformer models (Vaswani et al., 2017;Devlin et al., 2018;He et al., 2020) have considerably improved the state-of-the-art of the task. Nevertheless, these models still have problems detecting and labelling unseen or complex entities (Augenstein et al., 2017;Meng et al., 2021).
Many datasets (Malmasi et al., 2022b; and methods (Choi et al., 2018;Zhou et al., 2018;Mengge et al., 2020) have been developed as a result of the difficulty to label complex entities. The MultiCoNER (Malmasi et al., 2022b) shared task focuses on detecting semantically ambiguous and complex entities in low-context settings. Almost every participant of the task used Transformer based models as their backbone model, XLM-RoBERTa (Conneau et al., 2020a) being the most popular. The best two models presented for the task (Wang et al., 2022; show that the use of external KBs, such as Wikipedia and Wikidata, improve their results significantly. Meanwhile, other best performing models (Gan et al., 2022;Pu et al., 2022;Pais, 2022) take advantage of different data augmentation methods without relying on external KBs. The gap between the performance of the first two methods was less noticeable than the gap between the second best and the rest of approaches, implying that exploiting external KBs during training and inference is a promising research line. This difference was greater when more complex entities have to be identified. However, the overall results showed that recognizing complex or infrequent entities is still difficult (Malmasi et al., 2022c).
In this work we focus on the MultiCoNER2 (Fetahu et al., 2023) dataset, a second version of Mul-tiCoNER with higher granularity than its predecessor. Comparing to the state-of-the-art of the first MultiCoNER task, our approach for this task also uses XLM-RoBERTa for entity boundary detection and entity classification, but differs on how we retrieve information from the external KBs. Instead of retrieving relevant text passages to classify a given entity from a query sentence (Wang et al., 2022), or building a Gazetteer network from Wikidata entities and their derived labels to further enrich token representations , we directly predict Wikidata IDs from entity candidates by integrating mGENRE (De Cao et al., 2022a) as our entity linking module. Previous models have already tried to link entities with external KBs (Tsai et al., 2016;Zhou et al., 2018;Wang et al., 2021). However, mGENRE allows us to easily access both Wikidata entity descriptors and Wikipedia articles to increase our candidate's context for entity classification.

System description
Our method implements three steps. First we identify entity candidates in the input sentence. Then, we link the candidate to an existing knowledge base. Finally, we predict the fine-grained category for each entity candidate.

Entity Boundary Detection
Given unlabelled text as input, we predict named entity boundaries by analyzing the input sentence structure ( Figure 2). We treat this task as a sequence labelling task in which the model predicts if a given token is part of an entity or not ("B-ENTITY", "I-ENTITY","O"). We use the multilingual XLM-RoBERTa-large model (Conneau et al., 2020b) with a token classification layer (a linear layer) on top of each token representation. Our implementation is based on the sequence labelling implementation of the Huggingface open-source library (Wolf et al., 2019). We evaluate the model in the development set at the end of each epoch and then select the best performing checkpoint. We train five independent models and then use majority vote as the ensembling strategy at inference time.

Entity Linking and Information Retrieval
Given a sentence and the boundaries of a named entity between the sentence, we aim to link the entity mention to its corresponding Wikidata/Wikipedia page (Figure 3). To accomplish this task, we employ the mGENRE entity linking system (De Cao . mGENRE is a sequence-to-sequence system for Multilingual Entity Linking, which can generate entity names in over 100 languages from left to right, token-bytoken in an autoregressive manner, conditioned by the context. In order to ensure that only valid entity identifiers are generated mGENRE employs a prefix tree (trie) to enable constrained beam search. mGENRE predicts both, Wikipedia page title that corresponds to the entity and the language in which the entity mention is written, and subsequently both identifiers to the Wikidata ID. In our work, we utilize the pre-trained mGENRE model provided by its authors. The authors fine-tuned an mBART  model that had been pre-trained on 125 languages using Wikipedia hyperlinks in 105 languages. For each entity, we use constrained beam search to predict the five most probable Wikidata IDs for each predicted entity from the Entity Boundary Detection step.
As illustrated in Figure 4, we leverage the predicted Wikidata ID to obtain the Wikidata descrip- tion for the entity, as well as the contents of the in-stance_off and occupation arguments. In the multilingual track, we also retrieve the subclass_off argument. Due to time constraints, we were unable to retrieve this argument in the other tracks. As Wikidata pages contain links to their corresponding Wikipedia webpages, we also retrieve the Wikipedia summary. These summaries are generally more detailed than the Wikidata descriptions.
We initiate the retrieval process by querying the Wikidata ID with the highest probability, as predicted by mGENRE. If the predicted ID corresponds to a page that has been deleted, is empty, or is a list/disambiguation page, we discard the ID and proceed to the next most probable predicted ID. We found that, while Wikidata descriptions and arguments are mostly available in English for the entities in MultiCoNER, this is not the case for other languages such as Bangla or Farsi. Similarly, a large number of entities have a Wikipedia page available in English but not in other languages. For this reason, we always retrieve the Wikidata description and arguments, as well as the Wikipedia summary, in English.

Entity Category Classifier
In our final step, we aim to classify the entity candidates into fine-grained categories by combining all the knowledge we have. We create a text input by concatenating the sentence where the entity is annotated with an HTML-style markup, the wikidata description, a list of all retrieved arguments, and the Wikipedia summary. We use the special token __SEP__ to delimit each piece of information. If we fail to retrieve a Wikidata or Wikipedia page, we use the text "No Wikidata/Wikipedia summary found," and the model predicts the category solely based on the context of the entity.
We use the multilingual XLM-RoBERTa-large model (Conneau et al., 2020b) with a text classification layer (a linear layer) on top of the first token. We implement this using the Huggingface opensource library's text classification implementation (Wolf et al., 2019). During training, we generate the train and development datasets using the gold entity boundaries. During inference, we predict the categories for the entity boundaries computed in the entity boundary detection step. We evaluate the model at the end of each epoch and choose the model with the highest performance on the development set. We train five independent models and merge the predictions at inference using majority voting.

Experimental Setup and Dataset
We use the official MultiCoNER2 dataset  in all tracks to train our entity boundary and entity classification models. To query external knowledge, we utilize the Medi-aWiki API to retrieve data from the live version of Wikipedia/Wikidata. MultiCoNER2 consists of two substacks, the monolingual task with 12 monolingual datasets and a multilingual task that contains data from all languages. For 7 languages (EN, ZH, IT, ES, FR, PT, SV) some sentences were corrupted with noise, either on context tokens or entity tokens. The results of the shared task are evaluated based on entity-level macro F1 scores, where all label categories contribute equally to the final score, regardless of the number of entities labeled with each category. The dataset consists of 36 pre-defined fine-grained categories which are grouped into 6 coarse-grained categories: Medical entities, Locations, Creative Works, Groups, Persons and Products. A description of the hyperparameters used, dataset stats and hardware used in available in the Appendix.

Results
Our team participated in all the 12 monolingual tracks and the multilingual track. More than 35 team participated in the shared task. We achieved the 3rd positions in four tracks. The shared task results computed by the organizers are shown in Table 1. We first compare our system with an XLM-RoBERTa-large (Conneau et al., 2020b) baseline. For this baseline, we added a token classification layer (linear layer) on top of each token represen- tation of an XLM-RoBERTa-large model and finetuned the model to directly predict the fine-category for each token. The results were much lower than the ones achieved with our system, which confirms our hypothesis that external knowledge is needed to predict the entity categories. We also compare our system with the performance of systems from others teams that got the first position in any of the tracks. Our system was most competitive in Hindi and Bangla, as well as the multilingual track. Hindi and Bangla are by far the languages with the smallest Wikipedia's, ranked 59th and 63rd by the number of articles respectively. 2 We attribute this to our decision to retrieve the external knowledge from the English Wikipedia/Wikidata for all the tracks. Because we predict a wikidata ID, we are able to retrieve knowledge in any language in which it is available. This enables our system to be used for low-resource languages for which no large, up-to-date knowledge base is available, thanks to being able to link the entities found to a knowledge base in high-resource languages. On the other hand, our system is not as competitive in very high-resource languages, such as English and Chinese.

Clean vs Noisy Performance
In seven languages (EN, ZH, IT, ES, FR, PT, SV), some sentences contained noise, either on context tokens or entity tokens. Table 2 shows the results of our system on the set of sentences that were not corrupted and the set of corrupted sentences. We observed a significant drop in performance when noise was introduced in the sentences, especially for the Chinese track. We believe that this is because our entity linking step depends too much on the entity itself and does not sufficiently take into account the context in which the entity appears. Improving our system's ability to consider contextual 2 https://meta.wikimedia.org/wiki/List_of_ Wikipedias  information is something we need to work on.

Ablation Study
In this section, we evaluate the different steps in our system and check how different design decisions affect performance.
How important is external knowledge to classify entities? We aim to measure the importance of retrieving external knowledge for accurate entity classification in the MultiCoNER dataset. To achieve this, we train our text classification system on the training split and evaluate its performance on the development data. We use the gold labeled entity spans for both splits. We assess the relevance of external knowledge by varying the amount of external knowledge used in the evaluation. Table  3 shows the results obtained using different combinations of external knowledge, including using only the entity in its context (no external knowledge). As we observed in Table 1, the results without external knowledge are very poor. The most relevant external knowledge is the content of the Wikidata arguments. However, as we provide more external knowledge to the model, the performance improves. These results also demonstrate that our system achieves remarkable results in classifying entities into fine-grained categories when the cor-   Table 4: F1-score of the Entity Boundary Classification Step in the development split rect entity spans are provided, indicating that the entity boundary detection step hinders the performance of our system. The entity classification confusion matrix for the English development set is available in Appendix D.
Entity boundary detection performance Our initial hypothesis was that external knowledge is required to accurately classify entities into finegrained categories. However, we also hypothesized that detecting the boundaries of named entities does not require external knowledge. For instance, it is possible to identify that a sequence of words corresponds to the name of a person without needing to know anything about that person. Table 4 shows the performance the entity boundary detection step in the development dataset, which does not utilize any external knowledge. We were able to identify a significant proportion of the entities in the dataset. However, the results were lower than those of the entity classification step in Table 3. This indicates that the entity boundary detection step is the weakest point in our system. We believe that incorporating external knowledge into this step would enhance the results.

Conclusion
We have developed a system that identifies potential entity candidates and leverages external knowledge from an up-to-date knowledge base to classify them into a set of predefined fine-grained categories, addressing the challenges posed by temporal knowledge and unknown entities. Our results demonstrate exceptional performance in classifying entities into fine-grained categories, un-derscoring the need for external knowledge in accurately classifying entities in the MultiCoNER dataset. However, our current entity boundary detection step does not incorporate external knowledge, which we plan to improve in the future. Furthermore, our system exhibits promising results for low-resource languages where a comprehensive Wikipedia is unavailable. By linking the identified entities to a knowledge base in a highresource language, it can be used to process lowresource languages. In future work, we intend to integrate all the steps and insights gained from our current system into a single end-to-end model. Although each step performs well when evaluated independently, the current pipeline multiplies errors at each step, which we aim to address in the integrated model.

Acknowledgements
This work has been partially supported by the HiTZ center and the Basque Government (Research group funding IT-1805-22). We also acknowledge the funding from the following projects: Iker García-Ferrero, Oscar Sainz and Ander Salaberria are supported by doctoral grants from the Basque Government (PRE_2022_2_0208, RE_2022_2_0110 and PRE_2022_2_0219, respectively). Jon Ander Campos enjoys a doctoral grant from the Spanish MECD (FPU18/01271).

B Hyper parameters B.1 Entity Boundary Detection
We use the XLM-RoBERTa-large model in all our experiments. We train the model for 8 epochs, with a batch size of 16, AdamW optimizer (Loshchilov and Hutter, 2019) with 2e-5 learning rate. We use a max sequence size of 256 tokens, larger sequences are truncated to the first 256 tokens. The rest of hyper parameters are the default values for the sequence labelling implementation of the Huggingface library 3 . As described in 3.1 we evaluate the model in the development set at the end of each epoch and then select the best performing checkpoint. We train five independent models and then use majority vote as the ensembling strategy at inference time.

B.2 Entity Linking and Information Retrieval
We use the mGENRE implementation 4 in the fairseq library (Ott et al., 2019). We use the prefix tree provided buy the authors and marginalization. This constrains the model to only generate valid identifiers. We retrieve data from Wikipedia using the pymediawiki 5 wrapper and parser for the Me-diaWiki API. We use Wikidata client library for Python 6 to retrieve knowledge from Wikidata.

B.3 Entity Category Classifier
We use the XLM-RoBERTa-large model in all our experiments. We train the model for 8 epochs, with a batch size of 16, AdamW optimizer (Loshchilov and Hutter, 2019) with 2e-5 learning rate. We use a max sequence size of 256 tokens. The rest of hyper parameters are the default values for the text classification implementation of the Huggingface library 7 . As described in 3.3 we evaluate the model in the development set at the end of each epoch and then select the best performing checkpoint. We train five independent models and then use majority vote as the ensembling strategy at inference time.

C Hardware used
We perform all our experiments using a single NVIDIA A100 GPU with 80GB memory. The machine used has two AMD EPYC 7513 32-Core Processors and 512GB of RAM. Although our system can also run on GPUs with 24GB of VRAM.