DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System for Multilingual Named Entity Recognition

The MultiCoNER II shared task aims to tackle multilingual named entity recognition (NER) in fine-grained and noisy scenarios, and it inherits the semantic ambiguity and low-context setting of the MultiCoNER I task. To cope with these problems, the previous top systems in the MultiCoNER I either incorporate the knowledge bases or gazetteers. However, they still suffer from insufficient knowledge, limited context length, single retrieval strategy. In this paper, our team DAMO-NLP proposes a unified retrieval-augmented system (U-RaNER) for fine-grained multilingual NER. We perform error analysis on the previous top systems and reveal that their performance bottleneck lies in insufficient knowledge. Also, we discover that the limited context length causes the retrieval knowledge to be invisible to the model. To enhance the retrieval context, we incorporate the entity-centric Wikidata knowledge base, while utilizing the infusion approach to broaden the contextual scope of the model. Also, we explore various search strategies and refine the quality of retrieval knowledge. Our system wins 9 out of 13 tracks in the MultiCoNER II shared task. Additionally, we compared our system with ChatGPT, one of the large language models which have unlocked strong capabilities on many tasks. The results show that there is still much room for improvement for ChatGPT on the extraction task.


Introduction
The MultiCoNER series shared task (Malmasi et al., 2022b;Fetahu et al., 2023b) aims to iden-Figure 1: An example of wrong prediction in RaNER (Wang et al., 2022b)(one of the top systems in the Mul-tiCoNER I task (Malmasi et al., 2022b)) . This case illustrates that the knowledge covered is not sufficient for fine-grained complex NER. tify complex named entities (NE), such as titles of creative works, which do not possess the traditional characteristics of named entities, such as persons, locations, etc. It is challenging to identify these ambiguous complex entities based on short contexts (Ashwini and Choi, 2014;Meng et al., 2021;Fetahu et al., 2022). The MultiCoNER I task (Malmasi et al., 2022b) focuses on the problem of semantic ambiguity and low context in multilingual named entity recognition (NER). In addition, the MultiCoNER II task (Fetahu et al., 2023b) this year poses two major new challenges: (1) a finegrained entity taxonomy with 6 coarse-grained categories (Location, Creative Work, Group, Person, Product and Medical) and 33 finegrained categories, and (2) simulated errors added to the test set to make the task more realistic and difficult, like the presence of spelling mistakes.
The previous top systems (Wang et al., 2022b;) of the MultiCoNER I task incorporate additional knowledge in pre-trained language models, either a knowledge base or a gazetteer. RaNER (Wang et al., 2022b) builds a multilingual knowledge base based on Wikipedia and the original input sentences are then aug-mented with retrieved contexts from the knowledge base, allowing the model to access more relevant knowledge. GAIN  proposes a gazetteer-adapted integration network with a gazetteer built from Wikidata to improve the performance of language models. Although these systems achieve impressive results, they still have some drawbacks. First, insufficient knowledge is a common problem. As shown in Figure 1, the knowledge used in RaNER can help the model to identify Erving Goffman as a person, but cannot further determine the fine-grained category Artist. Second, these methods mostly suffer from the limited context length. Wang et al. (2022b) discards stitched text that is longer than 512 after tokenizing, which means that plenty of retrieved context is not visible to the model, leading to resource waste. Third, these systems have a single retrieval strategy. Wang et al. (2022b) acquires knowledge by text retrieval, while  accesses knowledge by dictionary matching. This single way of knowledge acquisition will result in the underutilization of knowledge.
To tackle these problems, we propose a unified retrieval-augmented system (U-RaNER) for finegrained multilingual NER. We use both Wikipedia and Wikidata knowledge bases to build our retrieval module so that more diverse knowledge can be considered. As shown in Figure 1, if we locate the entry for Erving Goffman in Wikidata, we can make use of fine-grained entity category information to facilitate predictions. Also, we discover that the retrieval context dropped by the model may also contain useful knowledge. Thus, we explore the infusion approach to make more context visible to the model. In addition, we use multiple retrieval strategies to obtain the most relevant knowledge from two knowledge bases, further improving the model performance.
Our main contributions are as follows: 1. We propose a unified retrieval-augmented system for fine-grained multilingual NER. Our system incorporates more diverse knowledge bases and significantly improves the system performance compared to baseline systems (Section § 4, § 5) 2. We initiated our investigation by identifying the primary bottleneck of the previous topperforming system, which we determined to be insufficient knowledge. Consequently, we focused on exploring both data and model en-hancements to improve system performance. (Section § 3) 3. We employ multiple retrieval strategies to obtain entity information from Wikidata, in order to complement the missing entity knowledge.
(Section § 4.1) 4. Additionally, we utilize the infusion approach to provide a more extensive contextual view to the model, thus enabling better utilization of the retrieved context (Section § 4.2). 5. Extensive experimental analysis demonstrates the effectiveness of diverse knowledge sources and broader contextual scopes for improving model performance. (Section § 5)

Related Work
Named Entity Recognition (NER) (Sundheim, 1995) is a fundamental task in Natural Language Processing. Because of the long-term attention and the rapid development of pre-trained language models, various models (Akbik et al., 2018;Devlin et al., 2019;Yamada et al., 2020;Wang et al., 2020Wang et al., , 2021a have achieved state-of-the-art results and performance in general NER scenarios and datasets, such as CoNLL 2002 (Sang, 2002), CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003), and OntoNotes 5.0 (Pradhan et al., 2013). Considering that the previous task settings or datasets are monolingual and scenarioconstrained, the task of Multilingual Complex Named Entity Recognition (MultiCoNER) is proposed to promote the NER research to be more oriented to real scenarios (Malmasi et al., 2022b). Our work focuses on this task and we will introduce the related work from the dataset and method of MultiCoNER respectively:  (Meng et al., 2021), thus, texts in Mul-tiCoNER are low in context to assess the model's performance on the more realistic and difficult setting.

Challenges of MultiCoNER Dataset
(2) Sufficient Diversity: MultiCoNER con-tains an rich variety of entity types, both simple and difficult, which makes it possible to evaluate the model more comprehensively.
(3) Reasonable Distribution: Considering the non-negligible long-tail distribution problem faced by the previous datasets makes the construction of training data extremely difficult, MultiCoNER ensures that the distribution of its entities is more even and reasonable so that it can be evaluated comprehensively. (4) High Complexity: Increasing the complexity of the dataset can effectively improve the quality of the dataset (Fetahu et al., 2021). Therefore, in addition to monolingual subsets, MultiCoNER also distinctively contains a multilingual subset and a code-mixed one, which makes it more challenging. Note that in the dataset version of SemEval-2023, this challenge and setting do not exist.

Progress of MultiCoNER Methods
With the MultiCoNER dataset as the core, the SemEval-2022 Task 11 attracts 236 participants, and 55 teams successfully submit their system (Malmasi et al., 2022b). Among them, there are many successful and excellent works worthy of discussion. DAMO-NLP (Wang et al., 2022b) proposes a knowledge-based method that gets multilingual knowledge from Wikipedia to provide informative context for the NER model. And they achieve the previous best overall performance on the Multi-CoNER dataset. USTC-NELSLIP  proposes a gazetteer-adapted integration network to improve the model performance for recognizing complex entities. QTrade AI (Gan et al., 2022) designs kinds of data augmentation strategies for the low-resource mixed-code NER task. Previous efforts and studies on the MultiCoNER dataset have shown that external data and beneficial knowledge are essential to improve the performance of NER models on it.
Retrieval-augmented NLP Methods Retrievalaugmented techniques have proven to be highly effective in various natural language processing (NLP) tasks, as evidenced by the exceptional performance achieved in prior studies (Lewis et al., 2020;Khandelwal et al., 2019;Borgeaud et al., 2022). These approaches usually contain two parts: an information retrieval module and a task-specific module. Specifically, in the context of named entity recognition (NER), Wang et al. (2021b) proposes leveraging off-the-shelf search engines like Google to retrieve external information and enhance the contextual representations of tokens in the input text, resulting in improved performance. Furthermore, subsequent research has focused on developing task-specific retrieval systems for domainspecific NER and multi-modal NER tasks, respectively (Zhang et al., 2022c;Wang et al., 2022a). Drawing upon these insights, our proposed system is designed and optimized with guidance from these previous works.
Large Language Models On IE Recent advances in NLP scale the parametric number of language models to hundreds of billions and have achieved phenomenal performance, such as GPT3 (Brown et al., 2020), OPT-175B (Zhang et al., 2022a), Flan-PaLM (Chung et al., 2022), LLaMA (Touvron et al., 2023) and ChatGPT 2 . In the field of information extraction (IE), ChatIE (Wei et al., 2023) first uses ChatGPT for extraction, and the results show that there is still much room for improvement. Recent work experiments with instruction fine-tuning (Wang et al., 2023b) and simplifying training objectives (Wang et al., 2023a) to adapt large language models to extraction tasks.

Data
The MultiCoNER II corpus (Fetahu et al., 2023a) aims to recognize the complex named entities and pose new challenges for current NER systems. To meet these challenges, we first reproduce the results of the top system (Wang et al., 2022b) and perform error analysis on validation sets. We observe that the performance bottleneck of the system lies in the lack of knowledge. Then, we investigate to break this bottleneck from data and model perspectives and improve model robustness.
Following Wang et al. (2022b), we build a multilingual KB based on Wikipedia of the 12 languages to search for the related documents. We download the latest (2022.10.21) version of the Wikipedia dump from Wikimedia 3 and convert it to plain texts. We execute the official system on MultiCoNER II corpus and categorize the results according to whether the annotated entity appears in the retrieval context or not. As shown in Table 1  While  uses Wikidata to build their gazetteer, we explore to enhance our retrieval system with Wikidata. Wikidata is a free and entitycentric knowledge base. Every entity of Wikidata has a page consisting of a label, several aliases, descriptions, and one or more entity types. As shown in Figure 2, Base indicates that only the Wikipedia knowledge base is used, and More Database indicates that we use both Wikipedia and Wikidata knowledge bases. The entity coverage improves on all 4 languages and achieves the maximum gain of 12.6% on ZH. In addition, as More Context shows, expanding the length of the retrieval context also brings more entity knowledge. Thus, we use the infusion approach to make more retrieval context visible to model. More details are described in Section § 4.2.

Methodology
Overview As depicted in Fig. 3, U-RaNER is comprised of two parts: a retrieval augmentation module and a NER module. The retrieval augmentation module utilizes multiple retrieval strategies and the NER module adopts a modified transformer structure to utilize the retrieved knowledge. Given an input sentence, U-RaNER retrieves similar texts and entities as external knowledge, which are then utilized in the form of text and vectors to help the NER module obtain improved predictions.

Retrieval Augmentation Module
In the retrieval augmentation module, we design three different retrieval strategies, namely TEXT2TEXT, TEXT2ENT, and ENT2ENT, which aim to obtain a variety of useful information from different sources to enhance our NER model.

TEXT2TEXT
The TEXT2TEXT retrieval strategy is to obtain texts related to input sentences from Wikipedia by the way of sparse retrieval (Mc-Donell, 1977;Robertson and Zaragoza, 2009). Through this form of retrieval, the goal is to obtain additional and useful relevant information as much as possible to alleviate the low-context problem of MultiCoNER. Specifically, we first parse the latest Wikipedia dumps and use ElasticSearch 4 to index them. And finally, we use each sentence in the dataset as the query and use the BM25 retrieval algorithm that comes with ElasticSearch to search in the built index database to obtain the Top-K documents related to the input sentence from Wikipedia, as shown in the first example of Table 2. Note that the TEXT2TEXT strategy is used by Wang et al. (2022b) to win 10 out of 13 tracks when competing in the SemEval-2022 Task 11.

TEXT2ENT
The TEXT2ENT retrieval strategy aims to retrieve candidate entities that may be mentioned in input sentences, as illustrated in the second example of Table 2. We believe that if the candidate entities that may be mentioned in the sentence can be retrieved in advance, the related knowledge might be helpful to build a stronger entity recognition model. The TEXT2ENT strategy is inspired by the related technologies of dictionary disambiguation (Harige and Buitelaar, 2016) and entity linking (Cao et al., 2021). But dictionary disambiguation can only perform hard matching, and there is no detailed annotation information for entity linking (that is, the corresponding information between span and entity), so these two traditional   methods cannot be directly applied to our scene. Therefore, in this part of the specific practice, we tried two different retrieval methods, namely sparse retrieval and dense retrieval. The details of these two retrieval methods are in the Appendix A.4.

ENT2ENT
The ENT2ENT retrieval strategy aims to retrieve some entities and their corresponding information from Wikidata. Wikidata integrates billions of structural information between millions of entities, such as the alias of entities and the relationships of entity pairs. And intuitively, such information is beneficial to our NER model. In the process of ENT2ENT retrieval, we want to find out external entity types which maybe inspire the entity labeling of the input sentence. Concretely, for each given entity, we first retrieve Wikidata to get its relevant Wikidata entities. Next, we gather and utilize the properties of the Wikidata entities from their corresponding Wikidata pages. In particular, we take the "instance of" and "sub-class of" properties as the entity types. For example, as shown in Table 2, with entity "deal hudson" as the query, ENT2ENT strategy will retrieve its type (i.e., "human") and description text. Finally, all relevant Wikidata entities and their types are as the retrieved augmented data. The detailed procedure for ENT2ENT is in the Appendix A.5.

Named Entity Recognition Module
BERT-CRF We use xlm-roberta-large (XLM-R) (Conneau et al., 2020) as the PLMs for all the tracks. Given an input sentence x = x 1 , x 2 , . . . , x n , transformer-based standard finetuning for NER first feeds the input sentence x into the PLMs to get the token representations h. The token representations h are fed into a CRF layer to get the conditional probability p θ (y | h), and the model is trained by maximizing the conditional probability and minimizing the cross entropy loss: L = − log p θ (y | h).
RaNER Given the retrieval contextx, we define a neural network parameterized by θ that learns from a concatenated input [x;x]. We feed the input and retrieve the representation [h;h]: We then feed h into the CRF layer and train by minimizing the conditional probability p θ (y | h) as mentioned above.
U-RaNER To exploit more retrieval contexts, we first slicex by model-limited input length asx =x 0 ,x 1 , . . . ,x m . Then, we keepx 0 as the text for concatenation, and feed the rest context list into PLM as [(x;x 1 ), . . . , (x;x m )], which is used in Lewis et al. (2020) for better information interaction, and get the token vector list For Pre-Infusion, we fetch the token vectors of the corresponding positions of the anchors from the vector list [h 1 , . . . ,h m ]. Then, we perform the mean operation to obtain the set of anchor vectors V ∈ R p×d , p is the number of anchors, and d is the hidden size. Considering that the word embedding layer in XLM-R has two input modes, including vocabulary index input as well as word embedding input, we first perform the former for [x; x 0 ] to obtain the input text embedding E, and later concatenate E and the anchor vectors V to form the word embedding input. Finally, we get the representation [h;h 0 ;h v ]. We only use h to pass the CRF layer.
For Post-Infusion, we first feed [x; x 0 ] to XLM-R and get the token representation [h;h 0 ]. For input representation list [h; h 1 , . . . , h m ], we perform the max operation on the token dimension to obtain the final representation h max . Then, we use h max for calculation as in BERT-CRF. Notably, we find that the post-infusion method is superior to the pre-infusion method in our preliminary experiments, and the default infusion method in the experimental section is post-infusion.

Ensemble Module
Given predictions {ŷ θ 1 , · · · ,ŷ θm } from m models with different random seeds, we use majority voting to generate the final predictionŷ. Following Yamada et al. (2020); Wang et al. (2022b), the module ranks all spans in the predictions by the number of votes in descending order and selects the spans with more than 50% votes into the final prediction. The spans with more votes are kept if the selected spans have overlaps and the longer spans are kept if the spans have the same votes.

Datasets and Evaluation Metrics
We use the official MultiCoNER II dataset (Fetahu et al., 2023a) in all tracks to train our models. The detailed data statistics is in the Appendix A.1 and A.3. The results on the leaderboard are evaluated with the entity-level macro F1 scores, which treat all the labels equally 5 .

Training Strategy
NER Model Training Our final NER models are trained on the combined dataset including both the training and development sets on each track to fully utilize the labeled data. For models trained on the combined dataset, we use the final model checkpoint after training. The detailed system configurations is in the Appendix A.2 Multi-stage Fine-tuning Multi-stage finetuning (MSF) aims at transferring the parameters of fine-tuned embeddings in a model at an early stage into other models in the next stage Shi and Lee (2021). The approach stores the checkpoint of fine-tuned XLM-R embeddings at the early stage and uses it as the initialization of XLM-R embeddings for model training at the next stage. Wang et al. (2022b) experimentally demonstrates that MSF can leverage the annotations from all tracks and thus improve performance and accelerate training. In addition, we observe that inconsistent training set sizes on different language tracks can also lead to degradation of model performance. We use increasing batch size and upsampling strategy to address this issue. The details are shown in the Appendix B.1.

Baselines
In this paper, we compare the proposed U-RaNER with the following baseline models: • BERT-CRF, as introduced in 4.2, is composed of a BERT-like encoder and a CRF decoder . It is widely used for sequence labeling  tasks. We use xlm-roberta-large (XLM-R) (Conneau et al., 2020) as the pretrained backbone for all the tracks.
• RaNER, as introduced in 4.2, improves BERT-CRF by incorporating retrieval contexts as input for better performance. Retrieval augmented methods have proven to be highly effective in the NER task (Wang et al., 2021b;Zhang et al., 2022c;Wang et al., 2022a).
• RaNER-MSF (Wang et al., 2022b) achieves the previous best overall performance on the Multi-CoNER I dataset, which exploits multistage fine-tuning to leverage the annotations from all tracks and thus improve performance and accelerate training of RaNER.
• ChatGPT, also known as gpt-3.5-turbo, is the most capable GPT-3.5 (Ouyang et al., 2022) model and optimized for chat. Following (Lai et al., 2023), our prompt structure for Chat-GPT consists of a task description, a note for output format, and an input sentence. Despite a Single-turn prompting strategy, we additionally try two enhanced prompting strategies: Multi-turn and Multi-ICL. Multi-turn first performs the task in 6 coarse-grained categories, and later performs finer-grained NER. Multi-ICL constructs demonstrations spliced after the note part by randomly selecting examples from the training set. The detailed prompting procedure for Single-turn, Multi-turn and Multi-ICL is in the Appendix A.6.
6 Results and Analysis

Main Results
There are 45 teams that participated in the Multi-CoNER II shared task. Due to limited space, we only compare our system with the systems from teams NLPeople, USTC-NELSLIP, IXA/Cogcomp, CAIR-NLP, PAI and NetEase.AI 6 . As NetEase.AI solely took part in the Chinese track, which means we only have access to their results for this specific track. In the post-evaluation phase, we evaluate the baseline system without the use of additional knowledge bases to further show the effectiveness of our retrieval-augmented system. The official results and the results of our baseline system are shown in Table 3. Our system performs the best on 9 out of 13 tracks with the average result exceeding the second-place system by the absolute F1-measure of 7.0%. Moreover, our system outperforms our baseline by the 19.18% F1-measure on average, which demonstrates that the retrievalaugmented system based on multiple knowledge bases is extremely helpful in identifying complex entities, leading to significant improvement on model performance.
In addition, we use three prompting strategies to evaluate ChatGPT. Due to the overwhelming number of test sets (millions of levels), the expense of invoking the OpenAI interface is unaffordable. We experiment on the validation set and the results are in Table 4. We observe that ChatGPT's performance on the multilingual NER dataset is quite poor, with an average F1-score of only 14.78% by the best strategy. Even on the coarse-grained level the result is merely 29.70% (Table 5), which is comparable to the result measured on MultiCoNER I (Malmasi et al., 2022b) by Lai et al. (2023).

Ablation Study
In this section, we perform extensive ablation experiments to show the effectiveness of various settings in our retrieval-augmented system. Following Wang et al. (2022b), we employ the multi-stage  fine-tuning (MSF) training strategy. As shown in Table 4, the model performance improves from 87.95% to 89.92%, which illustrates the effectiveness of the multi-stage training. Note that the following five rows in Table 4 all use the MSF training strategy.
For the different knowledge sources, the use of Wikipedia data achieves the gain of 12.61% (RaNER-MSF vs. BERT-CRF), the use of wikidata data achieves the gain of 13.16% (ENT2ENT vs. BERT-CRF), and using both together achieves the maximum gain of 15.46% (ENT2ENT vs. BERT-CRF). This shows that knowledge is highly useful for system performance and illustrates the complementarity of the two knowledge bases.
For the different knowledge acquisition methods, the ENT2ENT approach is superior to the TEXT2ENT approach (90.47% vs. 86.22%). In addition, we use the infusion approach to further improve the model performance (RaNER-MSF vs. TEXT2TEXT), which suggests that guaranteeing knowledge to be visible to model is also important. The default infusion method in our experiments is post-infusion. We also analyze the impact of the two different infusion methods on performance in the Appendix B.2.

Coarse-and-fine Category Analysis
To illustrate the advantages of U-RaNER on finegrained NER, we transform the model predictions to the coarse-grained level according to the official topology of fine-grained categories. We use the models of RaNER-MSF and U-RaNER w/ ENT2ENT in Table 4   in the Table 5, the improvements in coarse-grained metrics are significantly lower than those of finegrained metrics, differing by 1.38% (3.91% on the ZH track). It suggests that the proposed U-RaNER is better at coping with complex scenarios of finegrained classification. Besides, the average F1 for ChatGPT at different granularity is significant distinct (29.70% vs. 14.43%), which shows the difficulty in identifying fine-grained complex entities.

Query Relevance
We define a relevance metric to compute the relevance between the query and retrieval result. The metric calculates the Intersection-over-Union (IoU) between the characters 7 of the query and those of the retrieved result. We plot the results on the training set of 6 tracks in Figure 4. It can be observed that the IoU values of TEXT2TEXT strategy form a larger cluster than those of TEXT2ENT and ENT2ENT, which indicates that TEXT2TEXT re- 7 We take repeat characters as different characters. trieval would focus more on the context instead of merely the entities in the query text. Additionally, we observe that the distributions of ENT2ENT have larger medians than those of TEXT2ENT. This might due to ENT2ENT would retrieve more relevant entities from the Wikidata than TEXT2ENT. By employing diverse retrieval techniques, we can leverage data with distinct attributes to improve the effectiveness of the model.

Context Length Analysis
In this section, we focus on analyzing the impact of different context length on model performance. We conduct a series of experiments on EN, ES, PT and MULTI datasets with the context length ranging from 128 to 2048. We can observe from Figure 5 that the model performance increases as the context length grows. However, when the context list length exceeds 1024, the trend of performance improvement on all four datasets slows down. This indicates that the knowledge capacity in the contexts saturates as the length of the context increases.  For better performance, we need to find complementary and highly relevant contextual pieces as additional knowledge sources.

Error Analysis
We divided the NER task into two stages: mention detection to locate entity spans, and entity typing to classify the spans with pre-defined labels. To further analyze the limitations of our proposed model, we present the experimental results on 12 languages in Table 6. The experimental results reveal that the average F1 score for mention detection is 97.21, whereas the accuracy for entity typing is 90.35. These results provide evidence that the bottleneck in fine-grained NER is typing. More detailed discussion, including the different retrieval methods and case study, is in the Appendix B.3 and B.4.

Conclusion
In this paper, we propose a unified retrievalaugmented system (U-RaNER) for the Multi-CoNER II shared task, which wins 9 out of 13 tracks in the shared task. We expose that the bottleneck of the previous top system is the lack of knowledge. Accordingly, we use both Wikipedia and Wikidata knowledge bases with three retrieval approaches so that more diverse knowledge can be considered. Also, we explore the infusion approach to make more context visible to the model so as to make the best use of the resources. And the error analysis indicates that the entity typing sub-task is the bottleneck in the current system. In the future, we plan to exploit the knowledge in the large language model such as ChatGPT or LLaMA by self-verification or fine-tuning some adapters, in order to achieve robust generalization performance.

A.1 MultiCoNER II Corpus
The multilingual NER II corpus (MultiCoNER II 8 ) aims to recognize the complex named entities, like the titles of creative works which are not simple nouns, and pose challenges for current NER systems. With the same set of tags, the 12 multilingual datasets specifically include: BN-Bangla, DE-German, EN-English, ES-Spanish, FA-Farsi, FR-French, HI-Hindi, IT-Italian, PT-Portuguese, SV-Swedish, UK-Ukrainian and ZH-Chinese. Table 7 shows the detailed dataset statistics.

A.2 System Setup
For fair comparison with prior systems, we use xlm-roberta-large (Conneau et al., 2020) as our initial checkpoint. We use the AdamW (Loshchilov and Hutter, 2017) optimizer with a linear warmupdecay learning schedule and a dropout (Srivastava et al., 2014) of 0.1. We set the batch size and learning rate to 16 and 2e-5, and train models over 4 random seeds. According to the dataset sizes, we train the models for 5 epochs and 20 epochs for multilingual and monolingual models respectively. And all our experiments are conducted on a single NVIDIA A100 80GB GPU. For the ensemble module, we train about 4 models for each track.

A.3 Fine-grained Taxonomy
The tagset of MultiCoNER II is a fine-grained tagset including 6 coarse-grained categories and 33 fine-grained categories. The coarse-to-fine mapping of the tags are as follows: • Location (LOC): Facility, OtherLOC, Hu-manSettlement, Station; The Figure 6 shows the fine-grained taxonomy.

A.4 Detailed Procedure for TEXT2ENT
For sparse retrieval, we find the relevant entities from Wikidata which contains millions of entities. As in the TEXT2TEXT strategy, we utilize the description and alias information in the Wikidata and index them with ElasticSearch. We use each sentence in the dataset as the query and retrieve the candidate entity with the BM25 algorithm. In order to find candidate entities as much as possible, we apply an iterative retrieval procedure in which we construct a new query by masking the retrieved entities in the query text from the previous retrieval.
For dense retrieval, we utilize the title information and paragraph information 9 from Wikipedia 9 Considering the memory limit of dense retrieval model training, we truncate the paragraph information in wikipedia, and reserve the first 128 tokens for the construction of the knowledge base.
to construct the knowledge base for dense entity retrieval, then use the input sentence as the query to retrieve its related Top-K entities in the knowledge base. The dense retrieval model we use is the widely used Bi-Encoder architecture (Zhang et al., 2022b). Different from sparse retrieval, the dense retrieval model is trainable to better perceive the semantic characteristics of the MultiCoNER dataset. Therefore, in practice, we first preprocess the train/dev sets of MultiCoNER into the data format for dense retrieval model training. Specifically, because the train/dev sets provide the golden entity annotation of the sentence, we can fuzzy match the span in the sentence with the entity title in our knowledge base to link each span to a specific entity id. Then we use reconstructed training data to train a dense entity retrieval model with reliable performance, which will be finally applied to the test set to obtain candidate entities for the sentences in the test set.

A.5 Detailed Procedure for ENT2ENT
Suppose that we have already retrieved the boundaries of possible or relative entities of a sentence, we want to encode more knowledge about these entities to benefit the prediction of target entities and their types. A good choice is leveraging Wikidata which integrates billions of structural information between millions of entities, such as the alias of entities and the relationships of entity pairs. Therefore, we adopt the following steps to acquire ENT2ENT knowledge to augment the data so as to enhance the entity recognition ability of our model. 1. We preprocess Wikidata to construct two dictionaries of each language in this task. One takes each entity name and each alias string of each entity in Wikidata as keys and the index (called "Qid") of each entity as values.
The other takes Qid of each entity as keys and two attributes (called "subclass of" and "sub-instance of") content of each entity as values. It is worth mentioning that the values of the two attributes associated with each entity in Wikidata are themselves entities. Therefore, this method is referred to as ENT2ENT retrieval. For the following description, we call the first dictionary String-to-Qid and the second dictionary Qid-to-Types.
2. For each language, we retrieve argumentation data according to pre-retrieved entities and   the knowledge dictionaries from Step1. Concretely, for each retrieved entity, we first extract the corresponding Qid if it can match one key from the String-to-Qid dictionary. Next, if the first operation succeeds, we leverage the Qid to query the Qid-to-Types dictionary to get the values of "subclass of" and "subinstance of" as types of the retrieved entity. It is possible that the values of some Qid in the Qid-to-Types dictionary of a specific language are NULL. In this situation, we try to get entity types from the Qid-to-Types dictionary of English except for processing English itself.
3. If we get the language-specific types or English types of some pre-retrieved entities from Step2, we sequentially splice these preretrieved entities and their retrieved types after the original sentence. For those pre-retrieved entities without retrieved types, we only splice the pre-retrieved entities.
A.6 Detailed Procedure for Prompting Following (Lai et al., 2023), our Multi-turn prompt structure for ChatGPT consists of a task description, a note for output format, and an input sentence. Since the experiments in Lai et al. (2023) indicate that English prompts work better than multilingual ones, we use English prompts for all languages. As shown in Figure 7, the task description part is used to explain the task and list the entity categories, the note part indicates the annotation scheme and output format, and finally we add the input text. In our experiment, {...} is filled by the content in the Appendix A.3.
Multi-turn first performs the task in 6 coarse-grained categories, and later performs finergrained NER. In our experiment, {...} is filled by the response of ChatGPT and the content from in the Appendix A.3.
Multi-ICL constructs demonstrations spliced after the note part by randomly selecting examples from the training set. xxx is replaced with the selected example. The corresponding prompts can be found in Figure 8.
Task Description: You are working as a named entity recognition expert and your task is to label a given text with named entity labels. Your task is to identify and label any named entities present in the text. The named entity labels that you will be using are 33 categories, as shown below {...}. Note: Please use BIO annotation schema to complete this task. Please make sure to label each word of the entity with the appropriate prefix ("B" for the first word of the entity, "I" for any non-initial word of the entity). For words which are not part of any named entity, you should return "O". Your output format should be a list of tuples, where each tuple consists of a word from the input text and its corresponding named entity label. Input: ["from", "1995", "to", "2011", "deal", "hudson", "was", "the", "magazine's", "publisher", "."] Output: We observe that inconsistent training set sizes on different language tracks will lead to degradation  Task Description: You are working as a named entity recognition expert and your task is to label a given text with named entity labels. Your task is to identify and label any named entities present in the text. The named entity labels that you will be using are PER (person), LOC (location), CW (creative work), GRP (group of people), PROD (product), and MED (medical). Note: Please use BIO annotation schema to complete this task. Please make sure to label each word of the entity with the appropriate prefix ("B" for the first word of the entity, "I" for any non-initial word of the entity). For words which are not part of any named entity, you should return "O". Demonstrations: Optional. [Input: xxx, Output: xxx]. Input: ["from", "1995", "to", "2011", "deal", "hudson", "was", "the", "magazine's", "publisher", "."] Output: {...}. Input: Please complete the above task at a finer granularity based on the fine-grained taxonomy below {...}. Output: of model performance from 86.47% to 85.07%. We use increasing batch size and scaling up strategy to address this issue. From the Table 8, increasing batch size from 4 to 128 can improve the model performance from 85.07% to 86.82%. Furthermore, scaling up the training data size on BN, DE, HI and ZH can also result in a gain of +1.09%

B.2 Two Infusion Approaches
In the section § 4.2, we propose two infusion methods (Pre-Infusion and Post-Infusion) to make more context visible to the model. Here, we make a quantitative comparison of their effects on model performance. As shown in the Table 9, we observe that the post-infusion method is superior to the pre-infusion method in all language track. We attribute this to the fact that the pre-infusion  method only considers the anchor information and ignores other contextual information, while the post-infusion method uses more contextual knowledge and achieves better performance.

B.3 Different Retrieval Methods
To deeply analyze the effectiveness of the two TEXT2ENT retrieval strategies we design, we compare their retrieval performance (i.e., Recall@50) and the enhanced NER performance (i.e., F1) based on their respective retrieval results. From Table 11, we find that the retrieval performance of sparse retrieval does not seem to be worse than dense, and its recall is higher than dense retrieval for both PT and SV languages. In addition, for the BN and DE languages, although their recall results of sparse retrieval are lower than those of dense retrieval, their final performance of NER is higher than that of dense retrieval. We think this is mainly due to the different retrieval sources of the two retrieval strategies. Our sparse strategy is retrieved from Wikidata, while the dense strategy is retrieved from Wikipedia. The retrieval quality of Wikipedia is easily disturbed by the existence of entity alias. In addition, because the dense retrieval requires us to train the model, we actually truncate the paragraph information in Wikipedia for model training and retrieval, so the information that can be used for dense retrieval is also limited. However, from the ZH language, we know that the robustness of the dense retrieval strategy for different languages is better than the sparse retrieval strategy. Therefore, when dealing with retrieval in different languages, we can flexibly choose different strategies based on the quality of the retrieval resources in the corresponding language to obtain better performance.

B.4 Case Study
Table 10 provides a closer examination of the predicted results of BERT-CRF, RaNER, and U-RaNER respectively. We selected three cases from the English language dev data to analyze in detail.
In the first case, fine-grained NER necessitates comprehensive information to accurately classify long-tail entity spans. By utilizing knowledge from multiple sources, U-RaNER successfully predicts "pudendal nerve entrapment" in the first case.
In the second case, RaNER's typical ambiguity problem is evident, where the context retrieved from merely Wikipedia source lacks pertinent information about the target entity "gloucestershire" which could refer to either a county or a sports club.
However, in the third case, the retrieval-based systems wrongly predict "theles leites" and "jesse taylor" as "Athlete" due to retrieved knowledge indicating that they are both mixed martial arts fighters. This demonstrates that the use of retrieved information can sometimes be misleading and even harmful.