SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)

We present the findings of SemEval-2023 Task 2 on Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2). Divided into 13 tracks, the task focused on methods to identify complex fine-grained named entities (like WRITTENWORK, VEHICLE, MUSICALGRP) across 12 languages, in both monolingual and multilingual scenarios, as well as noisy settings. The task used the MultiCoNER V2 dataset, composed of 2.2 million instances in Bangla, Chinese, English, Farsi, French, German, Hindi, Italian., Portuguese, Spanish, Swedish, and Ukrainian. MultiCoNER 2 was one of the most popular tasks of SemEval-2023. It attracted 842 submissions from 47 teams, and 34 teams submitted system papers. Results showed that complex entity types such as media titles and product names were the most challenging. Methods fusing external knowledge into transformer models achieved the best performance, and the largest gains were on the Creative Work and Group classes, which are still challenging even with external knowledge. Some fine-grained classes proved to be more challenging than others, such as SCIENTIST, ARTWORK, and PRIVATECORP. We also observed that noisy data has a significant impact on model performance, with an average drop of 10% on the noisy subset. The task highlights the need for future research on improving NER robustness on noisy data containing complex entities.


Introduction
Complex Named Entities (NE), like the titles of creative works, are not simple nouns and pose challenges for NER systems (Ashwini and Choi, 2014). They can take the form of any linguistic constituent, like an imperative clause ("Dial M for Murder"), and do not look like traditional NEs (Persons, locations, etc.). This syntactic ambiguity makes it challenging to recognize them based on context.
We organized the Multilingual Complex NER (MultiCoNER) task (Malmasi et al., 2022b) at SemEval-2022 to address these challenges in 11 languages, receiving a positive community response with 34 system papers. Results confirmed the challenges of processing complex and longtail NEs: even the largest pretrained Transformers did not achieve top performance without external knowledge. The top systems infused transformers with knowledge bases and gazetteers. However, such solutions are brittle against out of knowledgebase entities and noisy scenarios (e.g. spelling mistakes and typos). For entities with fine-grained classes, apart from the entity surface form, the context is critical in determining the correct class.
MULTICONER 2 expanded on these challenges by adding fine-grained NER classes, and the inclusion of noisy input. Fine-grained NER requires models to distinguish between sub-types of entities that differ only at the fine-grained level, e.g. SCIENTIST vs. ATHLETE. In these cases, it is crucial for models to capture the entity's context. In terms of noise, we assessed how small perturbations in the entity surface form and its context can impact performance. Noisy scenarios are quite common in many applications such as Web search and social media. These challenges are described in Table 1, and our tasks defined below.
1. Monolingual: NER systems are evaluated on monolingual setting, e.g. models are trained and tested on the same language (12 tracks in total).

2.
Multilingual: NER systems are tested on a multilingual test set, composed from all languages in the monolingual track. are used to define the 12 monolingual subsets of the task. Additionally, the dataset has a multilingual subset which has mixed data from all the languages. MULTICONER 2 received 842 submissions from 47 teams, and 34 teams submitted system description papers. Results showed that usage of external data and ensemble strategies played a key role in the strong performance. External knowledge brought large improvements on classes containing names of creative works and groups, allowing those systems to achieve the best overall results.
Regarding noisy data, all systems show significant performance drop on the noisy subset, which included simulated typographic errors. Small perturbations to entities had a more negative effect than those to the context tokens surrounding entities. This suggests that current systems may not be robust enough to handle real-world noisy data, and that further research is needed to improve their performance in such scenarios. Finally, NER systems seem to be most robust to noise for PER, while most susceptible to noise for GRP.
In terms of fine-grained named entity types, we observed that performance was lower than the coarse types due to failure to correctly disambiguate sub-classes such as ATHLETE vs. SPORTSMANAGER. Some of the most challenging fine-grained classes include PRIVATECORP, SCIENTIST and ARTWORK.

MULTICONER VDataset
The MULTICONER V2 dataset was designed to address the NER challenges described in §1. The data comes from the wiki domain and includes 12 languages, plus a multilingual subset. Some examples from our data can be seen in Figure 1. For a detailed description of the MULTICONER V2 data, we refer the reader to the dataset paper (Fetahu et al., 2023). The dataset is publicly available. 2 2 https://registry.opendata.aws/multiconer English he also provided the voice of general wade eiling in justice league unlimited OtherPER VisualWork

Artist MusicalGRP
‫ن‬ ‫ا‬ ‫ا‬ HumanSettlement  These languages were chosen to include a diverse typology of languages and writing systems, and range from well-resourced (EN) to lowresourced ones (FA). MULTICONER V2 contains 13 different subsets: 12 monolingual, and a multilingual subset (denoted as MULTI).

Languages and Subsets
Monolingual Subsets Each of the 12 languages has its own subset.
Multilingual Subset This contains randomly sampled data from all the languages mixed into a single subset. This subset is designed for evaluating multilingual models, and should ideally be used under the assumption that the language for each sentence is unknown. From the test set of each language, we randomly selected at most 35,000 samples resulting in a total of 358,668 instances.

Dataset Creation
In this section, we provide a brief overview of the dataset construction process. Additional details are available in Fetahu et al. (2023).
MULTICONER V2 was extracted following the same strategy as Malmasi et al. (2022a), where sen-tences from the different languages are extracted from localized versions of Wikipedia. We select low-context sentences and the interlinked entities are resolved to the entity types using Wikidata as a reference, according to the NER class taxonomy shown in Table 2. Furthermore, to prevent models from leveraging surface form features, we lowercase the words and remove punctuation. These steps result in more challenging sentences that are more representative of real-world data.

Fine-grained NER Taxonomy
MULTICONER 2 builds on top of the WNUT 2017 (Derczynski et al., 2017) taxonomy entity types, and adds an additional layer of fine-grained types. We also drop the Corporation class, as it overlaps with the Group class. Furthermore, we introduce a new coarse grained class called Medical, which captures entities from the medical domain (e.g. DISEASE, ANATOMICALSTRUCTURE, etc.). Table 2 shows the 33 fine-grained classes, grouped across 6 coarse types.
The fine-grained taxonomy allows us to capture a wide array of entities, including complex entity structures, such as CW, or entities that are ambiguous without their context, e.g. SCIENTIST vs. ATHLETE as part of the PER coarse grained type.

Noisy Subsets
NER systems are typically trained on carefully curated datasets. However, in real-world scenarios, various errors may arise due to human mistakes. We applied noise only on the test set to simulate environments where NER models are exposed directly to user-generated content.
To evaluate the robustness of NER models, we corrupt 30% of the test set with various types of simulated errors in 7 languages (EN, ZH, IT, ES, FR, PT, SV). The corruption can impact context tokens and entity tokens. For Chinese, we applied character-level corruption strategies (Wang et al., 2018) which involve replacing characters with visually or phonologically resembled ones. For other languages, we developed token-level corruption strategies based on common typing mistakes made by humans (e.g., randomly substituting a letter with a neighboring letter on the keyboard), utilizing language specific keyboard layouts. 3 Table 3 shows the MULTICONER V2 dataset statistics. For most tracks, we released 16k training and 800 development instances (with the exception of DE, BN, HI, ZH due to data scarcity).

Dataset Statistics
The test splits on the other hand are much larger. This is done for two reasons: (1) to assess the generalizability of NER models in identifying unseen and complex fine-grained entity types, where the entity overlap between train and test sets is small, and and (2) to assess how models handle noise in contextual or entity tokens. For practical reasons, we cap the number of test instances to be less than 250k per subset for most languages (with the exception of DE, BN, HI, ZH which are already small due to data scarcity).

Task Description and Evaluation
The shared task is composed of 12 monolingual and 1 multilingual track. The multilingual track invited multilingual models capable of identifying entities from monolingual texts from any of the 12 languages. As described in Section 2.4, 30% of the test sets of the EN, ZH, IT, ES, FR, PT, and SV monolingual tracks are corrupted with simulated noise. We refer the subsets with corruption as noisy subsets and the rest as clean subsets.
For evaluation, we used the macro-averaged F1 scores to evaluate and rank systems. The F1 scores are computed over the fine-grained types using exact matching (i.e. the entity boundary and type must exact match the ground truth), and averaged across all types. We also report the performance on noisy subsets and clean subsets in Appendix A to study the impact on noise in §6.

Baseline System
Similar to the 2022 edition (Malmasi et al., 2022b), we train and evaluate a baseline NER system using XLM-RoBERTa (XLM-R) (Conneau et al., 2020), a multilingual Transformer model. The XLM-R model computes a representation for each token, which is then used to predict the token tag using a CRF classification layer (Sutton et al., 2012).
XLM-R is suited for multilingual scenarios, supporting up to 100 languages. It provides a solid baseline upon which the participants can build. It was trained with a learning rate of 2e − 5 and for 50 epochs, with an early stopping criterion of a non-decreasing validation loss for 5 epochs. The )   ARTIST  FACILITY  AEROSPACEMANUFACTURER CLOTHING  ARTWORK  ANATOMICALSTRUCTURE  ATHLETE  HUMANSETTLEMENT CARMANUFACTURER  DRINK  MUSICALWORK  DISEASE  CLERIC  STATION  MUSICALGRP  FOOD  SOFTWARE  MEDICALPROCEDURE  POLITICIAN  OTHERLOC  ORG  VEHICLE  VISUALWORK  MEDICATION/VACCINE  SCIENTIST  PRIVATECORP  OTHERPROD  WRITTENWORK  SYMPTOM  SPORTSMANAGER PUBLICCORP OTHERPER SPORTSGRP  code and scripts for the baseline system were provided to the participants to use its functionalities and further extend it with their approaches. 4

Participating Systems and Results
We have received submissions from 47 different teams. 4 th in HI. They proposed an unified retrievalaugmented system (U-RaNER) for the task. The system uses two different knowledge sources (Wikipedia paragraphs and the Wikidata knowledge graph) to inject additional relevant knowledge to their NER model. Additionally, they explored an infusion approach to provide more extensive contextual knowledge about entities to the model. PAI (Ma, 2023) ranked 1 st in BN, DE, 2 nd in FR, HI, IT, PT, 3 rd in EN, 4 th in ZH, 5 th in MULTI, 7 th in ES, FA, UK, and 8 th in SV. They developed a knowledge base using entities and their associated properties like "instanceof", "subclassof" and "occupation" from Wikidata. For a given sentence, they used a retrieval module to gather different properties of the entities by string matching. They observed benefits on the clean subset through the dictionary fusing approach. The same benefits were not observed on the noisy subset.   level through minimizing the KL divergence between their representations. In the second stage, two networks are trained together on the NER objective. The final predictions are derived from an ensemble of trained models. The results indicate that the gazetteer played a crucial role in accurately identifying complex entities during the NER process, and the implementation of a two-stage training strategy was effective. NetEase.AI (Lu et al., 2023) ranked 1 st in ZH. Their proposed system consists of multiple modules. First, a BERT model is used to correct any potential errors in the original input sentences. The NER module takes the corrected text as input and consists of a basic NER module and a gazetteer enhanced NER module. This approach boosted the performance on the entity level noise and gave the system a strong advantage over the other teams (Table 11). A retrieval system takes the candidate entity as input and retrieves additional context information, which is subsequently used as input to a text classification model to calculate the probability of the entity's type label. A stacking model is trained to output the final prediction based on the features from multiple modules. (N et al., 2023) ranked 2 nd in MULTI, ES, FA, SV, UK, 3 rd in FR, IT, PT, 4 th in EN, 7 th in DE, 8 th in HI, 10 th in BN, and 13 th in ZH. They developed a multi-objective joint learning system (MOJLS) that learns an enhanced representation of low-context and fine-grained entities. In their training procedure they minimize for: 1) representation gaps between fine-grained entity types within a coarse grained type, 2) representation gaps between an input sentence and the input augmented with external information for a given entity, 3) negative log-likelihood loss, 4) biaffine layer label prediction loss. Additionally, external context is retrieved via search engines for an input text, as well as Con-ceptNet data (Speer et al., 2016) to better represent an entity class with alternative names, aliases, and relation types to other concepts.

CAIR-NLP
SRCB (Zhang et al., 2023b) ranked 3 rd in ZH and 6 th in EN. The proposed approach, for an input sentence retrieves external evidence coming from Wikidata and Wikipedia, which is concatenated with the original input using special tokens (e.g. "context", "prompt & description") to allow their models (based on (Li et al., 2020)), to distinguish the different contexts. To retrieve the external context, the authors first detect entity mentions (Su et al., 2022) from the input sentence, then query the corresponding external sources.
NLPeople (Elkaref et al., 2023) ranked 3 rd in MULTI, 4 th in FA, 5 th in BN, DE, HI, UK, 6 th in ES, SV, 7 th in ZH, 8 th in FR, IT, PT, and 8 th in EN. They developed a two stage approach. First they extract spans that can be entities, while in the second step they classify spans into the most likely entity type. They augmented the training data with external context by adding relevant paragraphs, infoboxes, and titles from Wikipedia. On languages with smaller test sets, the infoboxes showed to obtain better performance than adding relevant paragraphs.
IXA/Cogcomp (García-Ferrero et al., 2023) ranked 3 rd in DE, HI, UK, SV, 4 th in MULTI, BN, ES, 5 th in PT, FA, FR, 6 th in IT, 7 th in EN, 8 th in ZH, and 8 th in EN. They first trained an XLM-RoBERTa model for entity boundary detection, by recognizing entities within the dataset and classifying them using the B-ENTITY and I-ENTITY tags. They employed a pre-trained mGENRE entity linking model to predict the corresponding Wikipedia title and Wikidata ID for each entity span based on its context. Then, they retrieved the "part of", "instance of", "occupation" attributes and the article summary from Wikipedia. Finally, they trained a text classification model to categorize each entity boundary into a fine-grained category using the original sentence, entity boundaries and the external knowledge.
Samsung Research China (SRC) -Beijing (Zhang et al., 2023a) ranked 2 nd in EN. They fine-tuned a RoBERTa based ensemble system using a variant of dice loss (Li et al., 2019) to enhance the model's robustness on long tail entities. In their case dice loss uses soft probabilities over classes, to avoid the model overfitting on the more frequent classes. Additionally, a Wikipedia knowledge retrieval module was built to augment the sentences with Wikipedia passages.
Sakura (Poncelas et al., 2023) ranked 5 th in ES, 6 th in BN, DE, HI, UK, 7 th in IT, SV, MULTI, 8 th in FA, 9 th in PT, ZH, and 11 th in EN. They used mBART-50 (Tang et al., 2020) to translate data from a source language to other target languages part of the shared task. Then, they aligned the tokens using SimAlign (Jalili Sabet et al., 2020) to annotate the entity tokens in the target language. Using the translated examples they increased the training data size between 30K to 102K sentences depending on the language, providing them with a 1% increase in terms of macro-F1.
KDDIE (Martin et al., 2023) ranked 5 th in EN. Using a retrieval index based on Wikipedia they enrich the original training data with additional sentences from Wikipedia. The data is used to train an ensemble of models, and the final NER scores is based on the vote from the different modules such as BERT-CRF, RoBERTa and DeBERTa.
MLlab4CS (Mukherjee et al., 2023) ranked 7 th in BN. MuRIL (Khanuja et al., 2021) was fine-tuned with an additional CRF layer used for decoding. MuRIL is specifically designed to deal with the linguistic characteristics of Indic languages.
CodeNLP (Marcińczuk and Walentynowicz, 2023) ranked 9 th in MULTI and 13 th in EN. mLUKElarge (Yamada et al., 2020) was fine tuned using different data augmentation strategies, where multiple data instances are concatenated as a single input. Their experiments show that the NER model benefits from the additional context, even when the context was unrelated to the original sentence.
silp_nlp (Singh and Tiwary, 2023) ranked 7 th in HI, 9 th in BN, 10 th in DE, SV, 11 th in ES, UK, 12 th in FR, IT, PT, 17 th in ZH, 19 th in EN. Their model is trained in two stages. XLM-RoBERTa is first pre-trained using the multilingual set. Then, the checkpoint is fine-tuned for individual languages.
garNER (Hossain et al., 2023) ranked 8 th in BN, 9 th in ES, SV, UK, FA, HI, IT, 10 th in PT, FR, ZH, 12 th in DE, MULTI, and 16 th in EN. The authors proposed an approach augmented with external knowledge from Wikipedia. For a given sentence and an entity, the Wikipedia API is called, and the retrieved result is concatenated together with the sentence to provide additional context for token classification. The entities are extracted via spaCy for English, and for other languages XLM-RoBERTa is used to detect entities. The authors performed ablation studies to analyze the model performance and found that the relevance of the augmented context is a significant factor in the model's performance. Useful context can help the model to identify some hard entities correctly, while irrelevant context can negatively affect model's predictions.
Sartipi-Sedighin (Sartipi et al., 2023) ranked 8 th in UK, 10 th in FA, 11 th in BN, DE, IT, PT, MULTI, 12 th in ZH, 13 th in HI, SV, 14 th in EN, ES, and 16 th in FR. They used a data augmentation approach, where for entities in the training dataset, additional sentences from Wikipedia are retrieved. The retrieved sentences are used as additional context. Then, they experimented with Transformer based model variations fine-tuned on different languages. Data augmentation helped their model in certain classes, but negatively impacted some other classes by increasing false negatives, e.g. SYMPTOM.
MaChAmp (van der Goot, 2023) ranked 8 th in MULTI. mLUKE-large (Yamada et al., 2020) was fine-tuned on data coming from all SemEval2023 text based tasks. For NER a CRF decoding layer used. For hyper-parameters they relied on the MaChAmp toolkit (van der Goot et al., 2021). They also experimented with separate decoders for each language, using intermediate task pre-training with other SemEval tasks, but did not find it useful for further improvements.
D2KLab (Ehrhart et al., 2023) ranked 9 th in DE, 10 th in ES, IT, UK, 11 th in FA, FR, 12 th in HI, SV, 13 th in PT, 14 th in BN, MULTI, 16 th in ZH, and 18 th in EN. T-NER library (Ushio and Camacho-Collados, 2021) was used to fine-tune a Transformer model. They additionally used 10 other publicly available NER datasets, in addition to the data from MultiCoNER 2 and MultiCoNER.
ERTIM (Deturck et al., 2023) ranked 9 th in FR, 11 th in ZH, 12 th in FA, and 20 th in EN. They finetuned different models for the different languages, e.g. BERT, DistilBERT, CamemBERT, and XLM-RoBERTa. Additionally, each input sentence is enriched with relevant Wikipedia articles for additional context. Furthermore, they annotated a set of additional Farsi sentences extracted from news articles, which provides their system with an improvement of 4.2% in terms of macro-F1 for FA.
LSJSP ( RIGA (Mukans and Barzdins, 2023) ranked 12 th in EN. The original data was augmented using GPT-3 to obtain additional context information, then XLM-RoBERTa (large) was fine-tuned using the adapter fusion approach (Pfeiffer et al., 2021). The additional context extracted through GPT-3 provides them with a performance boost of 4% in terms of macro-F1. The context is separated from the input sentence using the separator token The second one approached the problem with a seq2seq framework: sentences and statement templates filled by candidate named entity span are regarded as the source sequence and the target sequence. In the third approach they transformed NER into a QA task, where a prompt is generated for each type of named entity. The third approach showed strong performance in recall but overall performance was better using the stacked approach.
RGAT (Chakraborty, 2023) ranked 23 rd in EN. They used dependency parse trees from sentences and encode them using a graph attention network. The node representations were computed by taking into account the neighboring nodes and the dependency type. Additionally, they used features from BERT to make the final prediction for a token.
CLaC (Verma and Bergler, 2023) ranked 24 th in EN. They fine-tuned XLM-RoBERTa, finding that the span prediction approach is better than the sequence labeling approach.
Minanto (Höfer and Mottahedin, 2023) ranked 28 th in EN. XLM-RoBERTa was trained using the training data and a set of translated data from CoNLL 2003 and WNUT 2016 datasets.
6 Insights from the Systems Integrating External Knowledge: To overcome the challenges of complex entities, unseen entities, and low context, the integration of external data was a common theme among the submitted systems, similar to the prior edition. However, this time we observed many new and diverse knowledge sources and novel ways to inject the data into the models for NER prediction. For example, apart from using paragraphs retrieved from Wikipedia using search engine, participating teams used Wikidata, Wikipedia Infoboxes, and ConceptNet. Some of these approaches used knowledge sources to compute better representation of the entity labels.
Multilingual Models: Most participants in the multilingual track opted to use the task's baseline model, XLM-RoBERTa. Additionally, some par-ticipants used mLUKE, mDEBERTA, and mBERT. In terms of external multilingual resources, participants made mostly use of Wikipedia. Complex Entities: Our task includes several classes with complex entities such as media titles. The most challenging entities at the coarse level were from PROD class, where the average macro-F1 score across all participants was 0.68. This classes contains challenging entities, with highly complex and ambiguous surface forms, such as CLOTHING, where the average across all participants was macro-F1=0.58. There is a high variation among on the challenging coarse types, such as PROD. For instance, for EN the top ranked system, DAMO-NLP, achieves an F1 of 0.88, while the lowest ranking system IXA achieves a F1 of 0.21. This is highly related to whether the systems used external knowledge. Figure 2 shows a confusion matrix of coarsegrained performance. We note that PROD, MED and CW have low recall with more than 25% of the entities not being identified correctly. GRP is misclassified in 4.2% of the cases with other types such as LOC or CW, highlighting the surface form ambiguity of this type. On the other hand, PER obtains the highest score with 93.7%, yet at finegrained level often there is confusion among the different PER fine-grained types. Impact of Fine-grained classes: For coarse types such as PER, participants obtain very high scores, e.g. DAMO-NLP obtains an F1 of 0.97 on the noise-free test set. However, if we inspect the performance at the fine-grained level we notice high variance. For instance, SCIENTIST and OTHERPER obtain significantly lower scores with F1 scores of 0.70. This gap provides two main insights. First, while the PER class is often very easy to spot, distinguishing the more fine-grained types is much more challenging given their high ambiguity. Second, for fine-grained NER, captur-ing context is important. In this case we see that for a class like SCIENTIST, where its entities are often in scientific reporting context (e.g. research breakthroughs), pre-trained Transformer models often confuse such entities as either ARTIST or POLITICIAN, for which such models have much more pretrained knowledge. Appendix B provides an in-depth error analysis at the fine-grained entity type level for all coarse grained types.
Impact of Noise: Evaluation on the noisy subsets shows that most of the participants were impacted significantly. Comparing the difference in terms of macro-F1 on the noisy and the clean subsets, we notice that across all participants and languages, there is an average performance drop of 10%. The most impact is observed for ZH, where the gap can be as high as macro-F1 = 48%. Finally, we note that noise is mostly harmful when it affects named entity tokens, while noise on other has a minor impact in terms of NER performance. Across all participants and languages, the average performance dropped 11.1% when corruption was applied to entity tokens and 4.3% when it was applied to context tokens.

Conclusion
We presented an overview of the SemEval shared task on identifying complex entities in multiple languages. We received system submissions from 47 teams, and 34 system papers. On average, the wining systems for all tracks outperformed the baseline system by a large margin of 35% F1.
All top-performing teams in MULTICONER 2 utilized external knowledge bases like Wikipedia and gazetteers to provide additional context. We have also observed systems that provided information about the entity classes to help models know the definition of the entity. In terms of modeling, ensemble strategies helped the systems achieve strong performance. Finally, the impact of noise was significant for all submitted systems, with the macro-F1 dropping significantly when compared between the noisy and clean subsets of test data. Rob van der Goot. 2023. MaChAmp at SemEval-2023 Tasks 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, and

A Detailed Results for Noisy Test Sets
In this section, we provide the detailed performance for a subset of the monolingual tracks that contain a noisy test subset. For each team, we report the F1 scores for the clean subset and the subset with entity level and context level noise.
• Table 5 English (EN) • Table 6 Italian (IT) • Table 7 Spanish (ES) • Table 8 French (FR) • Table 9 Portuguese (PT) • Table 10 Swedish (SV) •       Figure 3 shows the misclassification across the different fine-grained types for the baseline approach on the EN test set. An ideal classifier would have a 100% performance on the diagonal.

CW.
For this class, the baseline has low recall, with many of the entities being missed (O tag). In terms of misclassifying the fine-grained types, we note that the highest confusion is between MUSICALWORK and VISUALWORK, with 7.4% of false positives.

GRP.
In the case of GRP, we notice a high confusion between ORG, PUBLICCORP and PRIVATECORP, with error rates going up to 26.3%. This highlights the difficulty of the different fine-grained classes, where context capture is important. Even more importantly in this particular problem of fine-grained NER, external knowledge or world knowledge of entities is crucial to distinguish between such fine-grained differences. In this case, external knowledge about different corporations may be necessary to correctly distinguish between different named entity types.
LOC. For this class, most of the errors are between FACILITY and OTHERLOC.

PER.
In the case of PER, SPORTSMANAGER is confused as ATHLETE in 41.2% of the cases (this is because many sports managers are former athletes). The PER coarse type is highly challenging in some of the fine-grained types, given that the surface forms can be highly ambiguous, and only the context can differentiate between the different types (ATHLETE, SCIENTIST, ARTIST, etc.)

MED.
In this case, we notice a high confusion between DISEASE and SYMPTOM, with 21.6%. This is an interesting insights, given that often, names for diseases and symptoms are used interchangeably (i.e., a symptom may cause a disease that is referred using the same name).

PROD.
Finally, here we notice that DRINK and FOOD are often confused with each other with 10.7%. This highlights some of the ambiguous cases where a drink may be considered both, e.g. milk. Finally, the most misclassification happen between VEHICLE and OTHERPROD. A potential cause for this is the lack of detailed type assignment of entities in Wikidata, which may lead to such misclassifications, i.e. OTHERPROD entities may actually belong to VEHICLE, however they are not explicitly associated with this type in Wikidata.