WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types

Multimodal Entity Linking (MEL) which aims at linking mentions with multimodal contexts to the referent entities from a knowledge base (e.g., Wikipedia), is an essential task for many multimodal applications. Although much attention has been paid to MEL, the shortcomings of existing MEL datasets including limited contextual topics and entity types, simplified mention ambiguity, and restricted availability, have caused great obstacles to the research and application of MEL. In this paper, we present WikiDiverse, a high-quality human-annotated MEL dataset with diversified contextual topics and entity types from Wikinews, which uses Wikipedia as the corresponding knowledge base. A well-tailored annotation procedure is adopted to ensure the quality of the dataset. Based on WikiDiverse, a sequence of well-designed MEL models with intra-modality and inter-modality attentions are implemented, which utilize the visual information of images more adequately than existing MEL models do. Extensive experimental analyses are conducted to investigate the contributions of different modalities in terms of MEL, facilitating the future research on this task.


Introduction
Entity linking (EL) has attracted increasing attention in the natural language processing community, which aims at linking ambiguous mentions to the referent unambiguous entities in a given knowledge base (KB) (Shen et al., 2014). It has been applied to a lot of downstream tasks such as information extraction (Yaghoobzadeh et al., 2016), question answering (Yih et al., 2015) and semantic search (Blanco et al., 2015).
As named entities (i.e., mentions) with multimodal contexts such as texts and images are ubiquitous in daily life, recent studies (Moon et al., 2018;Adjali et al., 2020a) turn their focus towards improving the performance of EL models through utilizing visual information, i.e., Multimodal Entity linking (MEL) 1 . Several MEL examples are depicted in Figure 1, where the images could effectively help the disambiguation for entity mentions of different types. Due to its importance to many multimodal understanding tasks including VQA, multimodal retrieval, and the construction of multimodal KBs, much effort has been dedicated to the research of MEL. Moon et al. (2018) first addressed the MEL task under the zero-shot setting. Adjali et al. (2020a) designed a model to combine the vi-  (Milne and Witten, 2008) News Wikipedia Tm → Te Multiple Multiple en 50 docs ACE2004 (Ratinov et al., 2011) News Wikipedia Tm → Te Multiple Multiple en 57 docs CWEB (Guo and Barbosa, 2018) Web Wikipedia Tm → Te Multiple Multiple en 320 docs WIKI (Guo and Barbosa, 2018) Wiki  (Moon et al., 2018) Social Media Freebase Tm, Vm → Te Multiple Multiple en 12K captions Twitter (Adjali et al., 2020a) Social Media Twitter users Tm, Vm → Te, Ve Multiple PER, ORG en 4M tweets Movie (Gan et al., 2021) Movie Reviews Wikipedia Tm, Vm → Te, Ve Movie PER en 1K reviews Weibo (Zhang et al., 2021) Social  sual, textual and statistical information for MEL. Zhang et al. (2021) designed a two-stage mechanism that first determines the relations between images and texts to remove negative impacts of noisy images and then performs the disambiguation. Gan et al. (2021) disambiguated visual mentions and textual mentions respectively at first, and then used graph matching to explore possible relations among inter-modal mentions.
Although much attention has been paid to MEL, the existing MEL datasets as listed in the middle rows of Table 1 have deficiencies in the following aspects, which hinder the further advancement of research and application for MEL.
• Limited Contextual Topics. As shown in Figure 2(a), the existing MEL datasets are mainly collected from social media or movie reviews, where there are only 5 topics in the social media domain and 1 topic in the movie review domain. But as we observed in the news domain, there are more than 10 topics including other popular topics like disaster and education. The lack of topics would limit the generalization ability of the MEL model.
• Limited Entity Types. Entities in the existing MEL datasets mainly belong to the types of "person (PER)" and "organization (ORG)". This restricts the application of the MEL models over other entity types such as locations, events, etc., which are also ubiquitous in common application scenarios.
• Simplified Mention Ambiguity: Some datasets such as Twitter (Adjali et al., 2020a) create artificial ambiguous mentions by replacing the original entity names with the surnames of persons or acronyms of organizations. Besides, limited entity types also lead to the limited mention ambiguity that only occurs with PER and ORG. According to our statistics of different domains as depicted in Figure 2(b), there are overall ten kinds of mention ambiguities in news domain such as Wikinews 2 , while existing datasets collected from social media or movie reviews only cover a small scope of ambiguity.
• Restricted Availability. Most of the existing MEL datasets are not publicly available.
To enable more detailed research of MEL, we propose a manually-annotated MEL dataset named WIKIDiverse with multiple topics and multiple entity types. It consists of 8K image-caption pairs collected from WikiNews and is based on the KB of Wikipedia with~16M entities in total. Both the mentions and entities are characterized by multimodal contexts. We design a well-tailored annotation procedure to ensure the quality of WIKIDiverse and analyze the dataset from multiple perspectives (Section 4). Based on WIKIDiverse, we propose a sequence of MEL models with intra-modality and inter-modality attentions, which utilize the visual information of images more adequately than the existing MEL models (Section 5). Furthermore, extensive empirical experiments are conducted to analyze the contributions of different modalities for the MEL task and visual clues provided by the visual contexts (Section 6). In summary, the contributions of our work are as follows: • We present a new manually annotated highquality MEL dataset that covers diversified topics and entity types.
• Multiple well-designed MEL models with intra-modal attention and inter-modal attention are given which could utilize the visual information of images more adequately than the previous MEL models.
• Extensive empirical results quantitatively show the role of textual and visual modalities for MEL, and detailed analyses point out promising directions for the future research.

Related Work
Textual EL There is vast prior research on textual entity linking. Multiple datasets have been proposed over the years including the manuallyannotated high-quality datasets like AIDA (Hoffart et al., 2011), automatically-annotated large-scale datasets like CWEB (Guo and Barbosa, 2018) (Milne and Witten, 2008), etc. However, as mentioned in (Cao et al., 2021), many methods have achieved high and similar results within recent three years. One possible explanation is that it may simply be near the ceiling of what can be achieved for these datasets, and it is difficult to conduct further research based on them.
Multimodal EL In recent years, the growing trend towards multimodality requires to extend the research of EL from monomodality to multimodality. Moon et al. (2018) first address the MEL task and build a zero-shot framework, which extracts textual, visual and lexical information for EL in social media posts. However, its proposed dataset is unavailable due to GDPR rules. Adjali et al. (2020a,b) propose a framework of automatically building the MEL dataset from Twitter. The dataset has limited entity types and ambiguity of mentions, thus it is not challenging enough. Zhang et al. (2021) study on a Chinese MEL dataset collected from the Chinese social media platform Weibo, which mainly focuses on the person entities. Gan et al. (2021) release a MEL dataset collected from movie reviews and propose to disambiguate both visual and textual mentions. This dataset mainly focuses on characters and persons of the movie domain. Peng (2021) propose three MEL datasets, which are built from Weibo, Wikipedia, and Richpedia information and use CNDBpedia, Wikidata and Richpedia as the corresponding KBs. However, using Wikipedia as the target dataset may lead to the data leakage problem as many language models are pretrained on it.
Our MEL dataset is also related to other named entity-related multimodal datasets, including entityaware image caption datasets (Biten et al., 2019;Tran et al., 2020;, multimodal NER datasets Lu et al., 2018), etc. However, the entities in these datasets are not linked to a unified KB. So our research of MEL can enhance the understanding of named entities, thereby enhancing the research in these areas.

Problem Formulation
Multimodal entity linking is defined as mapping a mention with multimodal contexts to its referent entity in a pre-defined multimodal KB. Since the boundary and granularity of mentions may be con-troversial, the mention span is usually pre-specified. Here we assume each mention has a corresponding entity in the KB, which is the in-KB evaluation problem.
Formally, let E represent the entity set of the KB, which usually contains millions of entities. Each mention m or entity e i ∈ E is characterized by the corresponding visual context V m , V e i and textual context T m , T e i . Here T m and T e i represent the textual spans around m and e i respectively. V m is the image correlated with m and V e i is the image of e i in the KB. In real life, entities in KBs may contain more than one image. To simplify it, we select the first image of e i as V e i and leave MEL with multiple images per entity as the future work. So the referent entity of mention m is predicted through: where Ψ(·) represents the similarity score between the mention and entity.

Dataset Construction
In this section, we present the dataset construction procedure. Many factors including annotation quality, coverage of topics, diversity of entity types, coverage of ambiguity are taken into consideration to ensure the research value of WIKIDiverse.

Data Collection
Data Source Selection 1) For the source of image-text pairs, considering news articles are widely-studied in traditional EL (Hoffart et al., 2011;Cucerzan, 2007) and usually cover a wide range of topics and entity types, we decide to use news articles. Wikinews and BBC are two popular sources of news articles. So we compared them from two aspects. As shown in Table 2, Wikinews has advantages in terms of alignment degree between image-text pairs and MEL difficulty. So we select the image-caption pairs of Wikinews to build the corpus. 2) For the source of KB, we use the commonly-used Wikipedia (Hoffart et al., 2011;Ratinov et al., 2011;Guo and Barbosa, 2018). We also provide the annotation of the corresponding Wikidata entity for flexible studies.
Data Acquisition 1) For the image-caption pairs, we collect all the English news from the year 2007 to 2020 from Wikinews with multiple topics including sports, politics, entertainment, disaster, technology, crime, economy, education, health and  weather. The data cover most of the common topics in the real world. Finally, we obtain a raw corpus with 14k image-caption pairs. 2) For the KB, we use the Wikipedia 3 . The entity set consists of all the entities in the main namespace with the size of 16M.
Data Cleaning For the image-caption pairs, we remove the cases that 1) contain pornographic, profane, and violent content; 2) the text is shorter than 3 words. Finally, we get a corpus with 8K imagecaption pairs.

Annotation
Annotation Design The primary goal of WIKIDiverse is to link mentions with multimodal contexts to the corresponding Wikipedia entity. Therefore, given an image-text pair, annotators need to 1) detect mentions from the text (Mention Detection, MD) and 2) label each detected mention with the corresponding entity in the form of a Wikipedia URL (Entity Linking, EL). For mentions that do not have corresponding entities in Wikipedia, they are labeled with "NIL". Seven common entity types (i.e., Person, Organization, Location, Country, Event, Works, Misc) are required to be annotated. To avoid subjective errors, we design detailed annotation guidelines with multiple samples to avoid the controversy of mention boundary, mention granularity, entity URL, etc. Details can be found in the Appendix. We also hold regular communications to discuss some emerging annotations problems.
Annotation Procedure The annotators include 13 annotators and 2 experienced experts. All annotators have linguistic knowledge and are instructed with detailed annotation principles. Each imagecaption pair is independently annotated by two annotators. Then an experienced expert goes over

Analysis of WIKIDiverse
Size and Distribution of WIKIDiverse We divide WIKIDiverse into training set, validation set, and test set with the ratio of 8:1:1. The statistics of WIKIDiverse are shown in Table 3. The collected Wikipedia KB has~16M entities in total (i.e. |E| ≈16M). Besides, we report the entity type distribution in Figure 4(a) and report the topic distribution in Figure 2(a).
Difficulty Measure Firstly, we compare surface form similarity of mentions and ground-truth entities. 51.31% of the mentions have different surface forms compared with ground-truth entities. Specifically, 16.05% of the mentions are totally different from the ground-truth entities. The large difference of the surface form brings challenges for MEL.
Secondly, we report the #candidate entities for each mention in Figure 4(b). Intuitively, the more entities a mention may refer to, the more ambiguous the mention is, and the more difficult the EL/MEL is. Specifically, we generate a m → e hash list based on the (m, e) co-occurrence statistics from Wikipedia (See Section 5.1 for details).  As shown in Figure 4(b), we can see that 1) 44.2% mentions have more than 10 candidate entities. 2) 16.7% mentions are not contained in the hash list, which means their candidates are the entire entity set of the KB.
Thirdly, we randomly sample 200 image-caption pairs from WIKIDiverse to evaluate the diversity of ambiguity. As shown in Figure 2(b), WIKIDiverse covers a wide range of ambiguity.

Methods
It is challenging to directly predict the entity from a large-scale KB because it consumes large amounts of time and space resources. Therefore, following previous work (Yamada et al., 2016;Ganea and Hofmann, 2017;Cao et al., 2021), we split MEL into two steps: 1) candidate retrieval (CR) is first used to guarantee the recall and obtain a candidate entity set consisting of the TopK entities that are most similar to the mention; 2) entity disambiguation (ED) is then conducted to guarantee the precision and predict the entity with the highest matching score.

Candidate Retrieval
Existing methods (Yamada et al., 2016;Ganea and Hofmann, 2017;Le and Titov, 2018) mainly utilize two types of clues to generate the candidate entity set E m : (I) the m → e hash list recording prior probabilities from mentions to entities: P (e|m). (II) the similarity between the contexts of mention m and entity e.
Following these works, we implement a series of baselines as follows: (I) P(e|m) (Ganea and Hofmann, 2017): P (e|m) is calculated based on 1) mention entity hyperlink count statistics from Wikipedia; 2) Wikipedia redirect pages; 3) Wikipedia disambiguation pages. (II) Baselines of textual modality: we retrieve the TopK candidate entities with the most similar textual context of the

Contrastive Entity Disambiguation
The interaction between multimodal contexts of mentions and entities is complicated. It may bring noises to the model without careful handling. So we also introduce several baselines to explore the fusion of multimodal information.
The key component of ED is to design the function Ψ(m; e i ) that quantifies the matching score between the mention m and every entity e i ∈ E m . As shown in Figure 5, the backbone of Ψ(m; e i ) includes different multimodal encoders of m and e i respectively, followed by dot-production to evaluate the matching degree between them. Specially, a multi-layer perceptron (MLP) is then used to combine the P (e|m). Formally, e * of m is predicted through: (1) So the multimodal encoders of mentions and entities are the most significant parts of MEL. They use the same structure but training with different parameters.
Multimodal Encoder Firstly, we get the textual context's embeddings. For the mention's textual context T m = {w 1 , . . . , w L 1 }, we directly embed it with the word embedding layer of BERT . While for e i , we embed it as the pre-trained embeddings from Yamada et al. (2020), which have compressed the semantics of e i 's entire contexts from Wikipedia.
Secondly, we get the visual context embeddings. Instead of the widely used region-based visual features, we adopt grid features following (Huang et al., 2020), which has the advantage of end-to-end. Specifically, the visual features are represented with the grid features from : where Flat(·) represents flatting the feature along the spatial dimension and L 2 indicates the number of grid features. Finally, taking the embeddings of the two modalities as inputs, we capture the interaction between them. We adopt several backbones to fuse multiple modalities. 1) UNITER : the two modalities are concatenated and then fed into self-attention transformers to fuse them together. 2) UNITER*: we apply separate self-attention transformers to the two modalities before UNITER for better feature extraction of each modality. 3) LXMERT (Tan and Bansal, 2019): the two modalities are fed into separate self-attention transformers at first and then interact with cross-modal attention. The design of intra-modal and inter-modal attention helps better alignment and interaction of multiple modalities.
After multiple layers of the fusion operation: Fuse ({ŵ 1 , ...,ŵ L 1 }, {v 1 , ...,v L 2 }), the hidden states of the mention's tokens {h i , ..., h j } are obtained. Then we concatenate the hidden states of the first and the last tokens and feed them into a MLP to get the mention's embeddings: Contrastive Loss We introduce contrastive learning (Karpukhin et al., 2020;Gao et al., 2021) to learn a more robust representation of both mentions and entities. It is widely acknowledged that selecting negative examples could be decisive for learning a good model. To this end, we utilize both hard negatives and in-batch negatives to improve our model's ability to distinguish between gold entities and hard/general negatives. Let e i,j represent the j th candidate entity of the i th mention in a batch and let P i denote the index of m i 's gold  Table 4: Performance of candidate retrieval. R@K represents recall of the TopK retrieved entities. The modality of P, T, V represent the P (e|m), textual context and visual context respectively. T+V and P+T+V represent the ensemble of different sub-methods. Results with * are generated using grid search over the Dev. dataset to find the best combination of different sub-methods. entity. The hard negatives are the other K − 1 candidate entities retrieved in CR step except for the gold entity: k =P i . The in-batch negatives are gold entities of other B − 1 mentions in the mini-batch: , where B represents the batch size. The optimization objective is defined as the negative log likelihood of the ground-truth entity: in-batch negatives (4) Besides the above baselines, we also compare with the following classic baselines: 1) Baselines of Textual Modality include REL (Le and Titov, 2018), BERT , and BLINK . 2) Baselines of Visual Modality include ResNet-50 and CLIP. 3) Multimodal Baselines include MMEL18 (Moon et al., 2018), MMEL20 (Adjali et al., 2020b). Details of the baselines can be found in the Appendix.
6 Experimental Results  Table 5: Comparison with baselines with results averaged over 5 runs. Models with † are enhanced with contrastive learning. All the models use the same candidate entity set retrieved through P (e|m)+BLINK+CLIP with K = 10.

As shown in
For retrieval, each mention takes about 12ms of P(e|m), 40ms of BM25, 183ms of WikiVec and CLIP, 60ms of BLINK; 2) As for ensemble of different modalities, T + V achieves better results than V and T, which verifies that the information of different modalities are complementary; In practice, we use grid search over the Dev. to find the best combination of different modalities. For example, when K = 10, the best E m is generated with 80%P+ 10%T + 10%V.

Entity Disambiguation Results
Following previous work, we report micro F 1 , precision, recall in Table 5. According to the experimental results, we can see that: First, the proposed multimodal methods outperform all the methods with a single modality, which benefit from multimodal contexts. Besides, contrastive learning can even improve the performance. We reckon that contrastive learning improves the ability to distinguish entities. Second, the textual baselines perform better that the visual ones, which indicates the textual context still plays a dominant role in MEL. Third, the methods using transformers to model the interaction between modalities perform better than those with simple interaction (Moon et al., 2018;Adjali et al., 2020a), which verifies the importance of fusing different modalities.

Multimodal Analysis
We also conduct some experiments on the ED tasks as following.

Are the multiple modalities complementary?
We draw a Venn diagram of different modalities in Figure 8. The circle of Method i is calculated through #Hit i |Dataset| and the interaction of two circles are calculated through #(Hit i ∩Hit j ) |Dataset| . One can see that the textual modality is dominant, while the visual modality provides complementary information. Specially, the multimodal method predicts more new entities of 9.38%, which verifies the importance of fusing two modalities.
Is it better to have multimodal contexts of both mentions and entities? We conduct an ablation study and report the experimental results in Table 6. We can see that the model with multimodal contexts of both mentions and entities achieves the best result. So linking multimodal mentions to multimodal entities is better than linking multimodal mentions to mono-modal entities as done in (Moon et al., 2018).
What visual clues are provided by the visual contexts? We randomly select 800 imagecaption pairs from the test dataset, and then ask annotators to label each mention with the types of visual clues. The visual clues include 4 types: 1) Object: the image contains the entity object. 2) Scene: the image reveals the scene that the entity belongs to (e.g. a basketball player of the 'basket-  Table 6: Ablation study to analyze modality absence of mention and entity. W/o T m/e or V m/e stands for LXMERT trained without the corresponding inputs. ball game' scene). 3) Property: the image contains some properties of the entity (e.g. an American flag reveals the property of a person's nationality). 4) Others: other important contexts. Note that the four types of clues can be crossed and a sample could have no clues. Examples of the visual clues can be found in Figure 6. We find that visual context is helpful for 60.54% mentions and 81.56% image-caption pairs. We report the contribution of different types of visual clues in Table 7. One can see that: 1) For scene clues, object clues and property clues, the T+V significantly outperforms T. It demonstrates that the multimodal model benefits a lot from the information of multiple types of visual clues in the images. 2) But our model still does not perform well with the scene and property clues. So fine-grained visual clues are not used well and this indicates the direction of future research. Bart writing HDTV is worth every cent in the chalkboard gag.

Entity (GT) Entity (T+V)
Field_goal ✅ ❎   Figure 8: Venn diagram illustration of contributions of different modalities. We remove the input of the corresponding modality of LXMERT to get the results without re-training the model. To avoid the interference of P (e|m), we also remove it from the model.

Case Study
We present several examples where multimodal contexts influence MEL in Figure 7. Example (a) and (b) verify the helpfulness of the multimodal context. From the error cases, we can see that the model still lacks such capabilities: 1) Eliminate the influence of unhelpful images (e.g., Example (c)); 2) Perform reasoning (e.g., inferring the "white house" from Example (d)'s image); 3) Alleviate over-reliance on P (e|m) (e.g., Example (e)).

Conclusion and Future Work
We propose WIKIDiverse, a manually-annotated Wikipedia-based MEL dataset collected from Wikinews. To overcome the weaknesses of existing datasets, WIKIDiverse covers a wide range of topics, entity types and ambiguity. We implement a series of baselines and carry out multiple experiments over the dataset. According to the experimental results, WIKIDiverse is a challenging dataset worth further exploration. Besides multimodal entity linking, WIKIDiverse can also be applied to evaluate the pre-trained language model, multimodal named entity typing/recognition, multimodal topic classification, etc. In the future, we plan to 1) utilize more than one images of each entity 2) adopt finer-grained multimodal interaction models for this task and 3) transfer the model to more general scenarios such as EL in articles.