Named Entity and Relation Extraction with Multi-Modal Retrieval

Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE. Most existing efforts largely focused on directly extracting potentially useful information from images (such as pixel-level features, identified objects, and associated captions). However, such extraction processes may not be knowledge aware, resulting in information that may not be highly relevant. In this paper, we propose a novel Multi-modal Retrieval based framework (MoRe). MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively. Next, the retrieval results are sent to the textual and visual models respectively for predictions. Finally, a Mixture of Experts (MoE) module combines the predictions from the two models to make the final decision. Our experiments show that both our textual model and visual model can achieve state-of-the-art performance on four multi-modal NER datasets and one multi-modal RE dataset. With MoE, the model performance can be further improved and our analysis demonstrates the benefits of integrating both textual and visual cues for such tasks.


Introduction
Utilizing images to improve the performance of Named Entity Recognition (NER) and Relation Extraction (RE) has attracted increasing attentions in natural language processing. Image information can be utilized in various domains such as social * Yong Jiang and Kewei Tu are the corresponding authors. ‡ : equal contributions. This work was done when Xinyu Wang was visiting StatNLP Research Group at SUTD. 1 Our code is publicly available at http://github.com/ modelscope/adaseq/examples/MoRe. media (Zhang et al., 2018b;Moon et al., 2018;Lu et al., 2018;Zheng et al., 2021b), movie reviews (Gan et al., 2021) and news (Wang et al., 2022d). Most of the previous approaches utilize images by extracting information such as feature representations (Moon et al., 2018;, object tags Zheng et al., 2021a) and image captions ) to improve the model performance. However, most of image feature, object tag and caption extractors are trained based on datasets such as ImageNet (Deng et al., 2009) and Visual Genome (Krishna et al., 2016), which mainly contain common nouns 2 instead of named entities. As a result, the extractors (especially those of object tags and image captions) often output information about common nouns, which may not be helpful for entity-based task models. Recently, pretrained vision-language models (Tan and Bansal, 2019b; have improved the performance of crossmodal tasks such as VQA (Agrawal et al., 2015), NLVR (Young et al., 2014) and image-text retrieval (Suhr et al., 2019) significantly. However, suffering from the same problem, the pretrained visionlanguage models do not achieve better performance than textual models in multi-modal NER (Sun et al., 2021;. Recently, approaches based on text retrieval have shown their effectiveness over question answering (Liu et al., 2020;Xu et al., 2022;Wang et al., 2022a), machine translation (Gu et al., 2018;Zhang et al., 2018a;Xu et al., 2020), language modeling (Guu et al., 2020;Borgeaud et al., 2021), NER (Wang et al., , 2022cZhang et al., 2022b) and entity linking (Zhang et al., 2022a;Huang et al., 2022). The approaches use the input texts as the search query to retrieve the related knowledge in the knowledge corpus (KC), which is a key-value structured memory built from the knowledge source. Besides, humans can recognize the entities (such as famous persons and locations) in the image based on their learned knowledge in practice. When they are not sure about the entities in the image, they can even use image-based retrieval in the search engine to get the related knowledge about the image. Inspired by that, for multi-modal NER and RE models, we believe retrieving the related knowledge of the image can be utilized to help the task models to disambiguate the named entities as well. In this paper, we propose Multimodal Retrieval based framework (MoRe), which explores the knowledge behind the input image and text pairs for multi-modal NER and RE. MoRe retrieves related knowledge for the input text and image using the textual retriever and image retriever respectively. The text retriever retrieves the most related paragraphs in the KC and the image retriever finds the documents containing the most related images. The retrieval results of each modality are sent to the textual and visual models respectively and used for training on NER and RE tasks. After both of the models are trained, the Mixture of Experts (MoE) module is trained to learn how to combine the model predictions from the two models.
The contributions of MoRe can be summarized in four aspects: 1. We propose a simple and effective way to inject knowledge-aware information into multi-modal NER and RE tasks using multi-modal retrieval, which is rarely introduced on multi-modal NER and RE tasks in previous work.
2. We empirically show that the knowledge from our text retrieval and image-based retrieval modules can significantly improve the performance of multi-modal NER and RE tasks.
3. We further propose MoE for multi-modal NER and RE, which can combine the knowledge from the image and text retrieval modules well. We show MoE can further improve the performance and achieve state-of-the-art accuracy.
4. We conduct detailed analyses that compare the advantage of the text retrieval module and image-based retrieval module. We show the MoE module can correctly take the advantage of the knowledge from each modal.

MoRe
Given an input text and image pair (x, I), where x = {x 1 , · · · , x n }, MoRe aims to predict the outputs y. y can be the label sequence or the relations for the NER and RE task respectively. MoRe feeds (x, I) into the multi-modal retrieval module, which returns retrieved texts z T and z I from a text retrieval module and a image-based retrieval module respectively. The textual task model and visual task model concatenate the input sentence x with the retrieved knowledge z T and z I from KC and predict the output distribution P (y|x, I, z T ) and P (y|x, I, z I ) respectively. Finally, the MoE module fuses the predictions from the models based on each modality and makes the final prediction P (y|x, I). The architecture of our framework is shown in Figure 1.

Multi-modal Retrieval Module
Text retrieval has been shown to be effective for NER , as the retrieval result can provide information for disambiguation of complex entities. In order to retrieve related knowledge of the text and image, we design a multi-modal retrieval module. We choose Wikipedia, the largest online encyclopedia as our knowledge source because it has a rich collection of articles describing entities and the articles should provide used clues for entity-related tasks. Considering the difficulty of pairing text and image in Wikipedia, we constructed two separate retrieval systems in the module for text and image respectively.
A retrieval system has two components: KC and knowledge retriever. The KC is denoted by {(k, v)}. The knowledge retriever calculates the relevance between an input query q and the keys in the KC. The retriever then returns the values corresponding to top-k keys that are most relevant to the query. We denote the retrieved result as {t 1 , ...t k }. A summarization of the two retrieval modules is shown in Table 1.

Textual Retrieval System
We retrieve the related knowledge of input text x with the textual retrieval system. We build the KC from the articles in Wikipedia. In the KC, each key is the sentence in Wikipedia and the corresponding value is the paragraph where the sentence appears. Considering the retrieval efficiency over a KC with 200 million entries, we choose to use a term-based text retriever. The retriever use BM25 (Robertson et al., 1995) to calculate the relevance score between key k and query x.
Image-base Retrieval System According to the style manual of Wikipedia 3 , the introduction section of an article is the summary of the most important contents and an image in an article is an important illustrative aid to understanding. Given an input image I, we search the related images in Wikipedia to collect the knowledge of related entities. Each key in KC is an image from a Wikipedia article and the corresponding value is the concatenation of the article title and introduction section of this article. To find the related images, we use an image encoder to encode the images into feature vectors. For each query I, it retrieves its k-nearestneighbors with the inner-product metric.
Context Processing The top-k results from the KC are concatenated into z = {[X], t 1 , · · · , t k }, where [X] is a special mark indicating the following sequence to be the retrieved texts. Given a certain transformer-based embedding model, we chunk the retrieved knowledge z so that the total subtoken number of x and z does not exceed the embedding's subtoken number limits. Since the retrieved texts are usually very long, it is hard to combine the retrieval results from two modalities 3 https://en.wikipedia.org/wiki/Wikipedia: Manual_of_Style together as a single z. As a result, we feed the retrieved knowledge z T and z I to separate models to let the model fully utilize each kind of information.

Task Model
Given the retrieved knowledge z (which is either z T or z I ), the task model predicts the probability distribution P (y|x, I, z) of the task.
Named Entity Recognition We take the NER task as a sequence labeling problem, which predicts a label sequence y = {y 1 , · · · , y n } at each position. The NER model feeds the concatenated text [x; z] into the transformer-based encoder and gets the token representations {r 1 , · · · , r n } corresponding to x: where m is the length of z. With the attention module in the transformer-based encoder, the token representations {r 1 , · · · , r n } contain the retrieved knowledge z. MoRe then feeds the representations into a linear-chain CRF layer to predict the probability distribution P (y|x, I, z) of the label sequence: where ψ is the potential function. In the potential function, W ∈ R d×t and b ∈ R t×t are parameters for calculating emission and transition scores respectively. d is the hidden size of r and t is the label set size. Y(x) denotes the set of all possible label sequences given x.

Relation Extraction
In RE, the model aims to predict a relation label P (y|x, I, z) given the subject entity {x starts , · · · , x ends } and object entity {x starto , · · · , x endo }. We follow PURE (Zhong and Chen, 2021), which adds special markers in the input x to indicate the named entities: In the equation, <S> and </S> indicate the start and end of the subject entity while <O> and </O> indicate the start and end of the object entity. Similar to the NER model, the RE model feeds the concatenated text [x ′ ; z] into the transformer based encoder and gets the token representations of <S> and <O>, denoted by r s and r o respectively: The probability distribution P (y|x, I, z) for relation extraction is then given by: where M ∈ R 2d×t ′ and b ′ ∈ R t ′ ×1 are the parameters for RE. t ′ is the label set size. Y ′ denotes the relation label set.

Mixture of Experts
As we mentioned in Section 2.1, we use a separate retrieval module for each modality. The retrieval scores of retrieved texts from each modal are therefore not comparable. As a result, it is hard to determine a priori which retrieved knowledge is more helpful to the model performance. We use MoE to alleviate this problem. The MoE module aims to fuse the probability distributions from the textual model and visual model to get better model performance. To obtain the overall probability distribution of generating y, we treat e as a latent variable and calculate the marginal distribution over e, which is: where θ e is the task model parameters trained with the retrieved knowledge z e and θ c is the model parameters of MoE. To calculate P θc (e|x, I), we use a text encoder and an image encoder to extract the representation of x and I respectively: where U ∈ R (d T +d I )×t and b ⋆ ∈ R t×1 . d T and d I are the dimension of r T and r I respectively. The final prediction from MoE module is given by: whereŶ can be Y(x) for NER or Y ′ for RE. In RE, y can be easily calculated by finding the largest probability among all possible relation label y. However, for NER with a linear-chain CRF layer, y represents the corresponding label sequence of the input x. The possible label sequence set Y(x) is exponential in size. As a result, it is difficult to calculate the weighted summation of two probability distributions (i.e. P θ T (y|x, I, z T ) and P θ I (y|x, I, z I )) with an exponential number of possible label sequence. Instead of directly calculating the equation, we approximate this process by assuming the label at each position can be independently determined: where Y ⋆ is the NER label set. We use forwardbackward algorithm to calculate the marginalized probability distribution P θe (y i |x, I):

Training
Named Entity Recognition We use the negative log-likelihood (NLL) as the training loss for the input sequence with gold labels y * : Relation Extraction Similar to NER, we calculate the NLL loss with the gold label y * : Mixture of Experts Given the trained task models with parameters θ T and θ I , the MoE model is trained with NLL loss with P θc (y|x, I). The parameters of trained task model θ T and θ I are fixed during the training of MoE.

Settings
Retrieval System Configuration For the retrieval systems, we build the KCs using the English Wikipedia dumps. We convert the dumps into plain text and download the images appearing in the articles. To take advantage of the rich anchors in Wikipedia, we mark them with a special tag. For example, the anchor of "Alan Turning" is tagged and the text "Alan Turing published an article ..." is transformed into "<e: Alan_Turing> Alan Turing </e> published an article ...". There are about 200 million entries in the KC for textual retrieval and 4 million entries in the one for image-based retrieval. We build the term-based textual retriever with the search engine ElasticSearch 4 . We use ViT-B/32 in CLIP to encode images in the feature-based image retriever and use Faiss (Johnson et al., 2019) for efficient search. For both retrieval modules, we use the top-10 retrieval candidates. Training Configuration During training, we finetune the models by AdamW (Loshchilov and Hutter, 2018) optimizer. In all experiments, we use the grid search to find the learning rate for the embeddings within [1 × 10 −6 , 5 × 10 −5 ]. We use a learning rate of 5 × 10 −6 and a batch size of 4 for task model training. Following ITA , we use the cross-view alignment loss to minimize the KL divergence between the output distributions of retrieval based input and original input. For MoE, we use the same learning rate and a batch size of 64 instead. The task models are trained for 10 epochs and the MoE models are trained for 50 epochs. All of the results are averaged from 3 runs with different random seeds.

Results
We compare MoRe with our baseline and previous state-of-the-art approaches on multi-modal NER and RE. Our baseline is the model without any retrieval module and MoE module.  text retrieval module 7 and the performance only with the image-based retrieval module to show the strength of the retrieval module. The results in Table 2 show that MoRe outperforms all of the previous state-of-the-art approaches 8 . Only with the text retrieval module or the image-based retrieval module, our model performance is competitive and even outperforms ITA. On WikiDiverse dataset, our models have more improvements compared with ITA. The possible reason is that our approach can retrieve more helpful information from KC in the news domain while the caption and object extractors do not perform well in this domain. This shows the performance of ITA may be limited for certain task domains. Comparing our models with text retrieval, image-based retrieval and our baseline, our retrieval approaches are significantly stronger than our baseline (with Student's t-test with p < 0.05).
In most of the cases, models with the image-based retrieval module perform better than models with a text retrieval module except on WikiDiverse dataset. The possible reason is the knowledge from the text retrieval is more critical in the news domain.

Comparison with Other Variants of MoE
To further show the advantage of our MoE module over text and image models, we compare several variants in Table 3. In this analysis, we mix two tex-7 The model is similar to the model of . Our textual retrieval is based on Wikipedia, while the textual retrieval in  is based on Google. Our local retrieval module is much faster and more practical. 8 Note that all of the previous approaches do not use any retrieval techniques for the tasks.  Text/Image: a mixture of two text retrieval/image-based retrieval based models with two random seeds, Text+Image: a mixture of a text retrieval based model and a image-based retrieval based model. tual models and two image models with different random seeds for comparison. Firstly, we compare our MoE approach with average pooling, which averages the probability distributions over two models. The results show that our MoE approach outperforms all the average pooling approaches significantly (with student's T test with p < 0.05) except on the SNAP dataset, which shows the effectiveness of MoE. Comparing among the three average pooling approaches, we can find that averaging the probability distributions over the text and image models performs better than averaging the probability distributions of two models from the same modality. Our MoE is also significantly stronger than the single modality MoE approaches on all the datasets (with p < 0.05). Moreover, we find the relative improvements of MoE depend on each specific dataset. For example, the improvements are relatively smaller in T-15 and SNAP compared with the MoE and average pooling while the improvements on T-17 and Wiki datasets are relatively larger. A possible reason is that the importance of text retrieval and image retrieval is almost equal in most of the samples in the T-15 and SNAP datasets. Similarly, the advantage of multi-modal MoE over single-modal MoE depends on the dataset as well. When the retrieved knowledge from images and that from texts are more complementary, the relative improvements will be much higher (e.g. Wiki).

How Text Retrieval and Image-based Retrieval Affect Model Prediction
To further show the advantage of text retrieval and image-based retrieval module in MoRe, we compare the label-wise F1 in   tion, organization, others and person as the representative labels, which are the most common labels in the dataset and are consistent with the entity label set in SNAP and MNRE datasets. For MNRE, we calculate the entity-label-based F1 score, which calculates the relation F1 score for each entity type. For example, if a relation of two entities is predicted as "/per/org/member_of" (which means the subject is person type, the object is organization type and the relationship between them is "mem-ber_of"), the relation will be counted into the relation F1 score for both "per" and "org" entities. We use this way to calculate the relation F1 score to analyze how the retrieval system affects each entity label in RE. From the results in Table 4, we can observe that 1) the models with the retrieval module outperform our baselines over all the labels; 2) the image-based retrieval module in MoRe is much helpful for recognizing person and organization entities; 3) the text retrieval module in MoRe is helpful for recognizing other entities; 4) for location entities, the text retrieval module has an advantage in NER while the image-based retrieval module has an advantage in RE. The possible reason is that the image-based retrieval can easily capture the person and organization entities since people and organization usually appear in the image. However, other entities such as the entity name of creative works and festivals are hard to be presented in the images. The related knowledge of such kinds of entities can be easily found through text retrieval.

How the Knowledge Quality Affects Performance
We analyze how the task model will perform when the quality of retrieved knowledge drops. We ran-  domly select the retrieved knowledge from the KCs of text retrieval module and image-based retrieval module respectively and train the models based on the random knowledge. The results in Table 5 shows that in both of the conditions, the model performance drops moderately compared with our baseline that is trained without retrieval results. The observation shows that using the random retrieved knowledge can introduce noises to the model. The improvements of the task models come from the related knowledge provided by our designed retrieval module rather than the extended input sequence length.

How Approximation Affects Performance
In Section 2.3, we propose to calculate the marginal probability distribution of the CRF layer for NER to approximately calculate the MoE target function. To show how the approximation may affect the model performance, we compute the prediction of our NER model by calculating arg max y i ∈Y ⋆ P θe (y i |x, I) at each position. The results of our task models with the text retrieval and imagebased retrieval modules are shown in Table 6. The results show that the approximation only drops the model performance by no more than 0.1 F1 score. Therefore we can use the approximated probability distribution to calculate the MoE target function, which can be much easier than the original function for the linear-chain CRF layer.

Speed Comparison
In  memory with a batch size of 1. To further show the advantage of MoRe, we also calculate the speed of feature extraction parts (i.e. object and caption extractors based on VinVL (Zhang et al., 2021c)) of ITA 9 . We can observe that the bottleneck of MoRe is the CLIP feature extraction part, but the speed is much faster than the feature extraction module in ITA. The observation shows the speed advantage of MoRe over ITA.

Case Study
In the case study, we show the importance of knowledge from text retrieval and image-based retrieval.
In Figure 2 (a), the text input talks about the festival at "Cannes" while the image input shows the beaches at the place. The text retrieval results are mainly talking about the Cannes Film Festival while the image-based retrieval results are mainly talking about the similar beaches in the world. For the entity "Cannes", the text retrieval results are much more helpful to the disambiguation of the entity since the text retrieval results mention the location "Cannes" multiple times. In Figure 2 (b), it is hard to recognize the named entity "Kolo" is the dog's name only given the input text. The text retrieval also fails to find the related information to the sentence. However, the image-based retrieval returns knowledge about similar kinds of the dog in the input image, which helps the model to recognize the entity "Kolo" is possibly the dog's name instead of a person's name.

Related Work
Introducing Visual Information to Improve Multi-modal NLP Tasks In the natural language processing community, improving NLP tasks by introducing visual information becomes a hotspot of recent studies. In many scenarios, there is a plenty of visual information for a lot of NLP tasks such as NER (Zhang et al., 2018b;Moon et al., 2018;Lu et al., 2018), RE (Zheng et al., 2021a,b), keyphrase prediction  and entity linking (Gan et al., 2021;Zhang et al., 2021b;Wang et al., 2022d). Most of the approaches propose to introduce a special attention mechanism to model the interaction between the representations of objects in the image and the input text (Zhang et al., 2018b;Sun et al., 2021;Zheng et al., 2021a).  and  additionally introduce OCR texts and image captions to the tasks for further improvements. Recently,  suggests that the representations of images and texts are trained separately and the representations are not aligned. It is hard for the newly introduced attention mechanism to model the interaction. They propose to convert an image into the text to ease the alignment problem between text and image. They convert the image into object tags, image captions, and OCR texts for the model. However, the approach may be limited to the training domain of the image information extractor. In comparison, we explore the related knowledge of an image instead of the surface information of an image. The KC can be much easier to build for the specific domains since building it only requires a large scale of domain-specific unlabeled data. Pretrained vision-language models such as LXMERT (Tan and Bansal, 2019a), UNITER , Oscar , E2E-VLP (Xu et al., 2021) and mPLUG (Li et al., 2022) are trained on image-text pairs and achieve significant improvement on tasks like captioning, VQA and image-text retrieval. The pretraining targets at aligning the image and text features into the same space so that the performance of multi-modal tasks can be improved. However, the text representations in pretrained vision-language models are usually not as strong as the pretrained language models. As a result, some of the recent work (Sun et al., 2021; find that the pretrained vision-language models do not perform well on multi-modal NLP tasks such as NER.
Retrieval-based NLP For knowledge-intensive NLP tasks, retrieval is an effective methods to utilize external knowledge. The knowledge retrieval has been applied to a lot of NLP tasks such as question answering (Liu et al., 2020;Xu et al., 2022;Izacard and Grave, 2021), machine translation (Gu et al.,

Gold:
S-MISC Baseline: S-PER MoRe Text : S-PER MoRe Image : S-MISC Figure 2: Two case studies of how the text retrieval and image-based retrieval help model predictions.
2018; Zhang et al., 2018a;Xu et al., 2020), NER (Wang et al., , 2022cZhang et al., 2022b) and entity linking (Zhang et al., 2022a;Huang et al., 2022). Compared with these work, our work novelly introduces an image-based retrieval module, which retrieves the knowledge behind the image to improve multi-modal NER and RE tasks. Recently, some of the work introduces the knowledge retrieval to language model pretraining. REALM (Guu et al., 2020) trains the latent knowledgeretriever and knowledge-augmented encoder in an end-to-end manner during the pretraining and finetuning. The generative process in REALM is decomposed into retrieving and predicting. The retrieved knowledge is treated as a latent variable and marginalized. Inspired by the generative process, our MoE module treats whether the retrieved knowledge is from text or image as the latent variable. While REALM aggregates the top-k retrieved knowledge from text with the latent variable, we use it to aggregate the knowledge retrieved from different modalities. Since the retriever is trainable, REALM needs to asynchronously re-embedding and re-indexing all documents during the training. In order to scale with a larger database size, RETRO (Borgeaud et al., 2021) freezes the retriever and applies a chunked cross-attention mech-anism to make use of databases of trillion tokens. For efficiency consideration, we also freeze the retriever module in MoRe.

Conclusion
In this paper, we introduce a novel Multi-modal Retrieval based framework that utilizes the knowledge behind the multi-modal inputs. MoRe first retrieves related knowledge of input text and image from a text retrieval module and an image-based retrieval module. MoRe then feeds the retrieved knowledge from the text retrieval module and the image-based retrieval module into the textual and visual task models respectively to make predictions. Given the predictions from the task models of each modality, MoRe combines the prediction by a Mixture of Experts (MoE) module. The MoE module takes the features of each input text and image into consideration and makes the final decision. In our experiments, we show that both our textual model and visual model can achieve state-of-the-art performance on four multi-modal NER datasets and one multi-modal RE dataset. With MoE, the model performance can be further improved. In analysis, we demonstrate the advantage of integrating both textual and visual cues for such tasks over different types of labels.

Limitations
In this paper, MoRe requires a textual and a visual KC for the task. We build the KCs based on Wikipedia. However, in some of the scenarios, the KC needs domain-specific unlabeled data for these scenarios. In these cases, the unlabeled data, especially the data with images, should be collected with effort. Moreover, the input length of MoRe is significantly longer than the original input texts since the new inputs contain the retrieved knowledge. As a result, the inference speed should be significantly slower than the speed with the original input texts. Therefore, MoRe may not satisfy some of the time-critical scenarios. However, we can use the techniques such as knowledge distillation (Hinton et al., 2015) to distill the knowledge from MoRe to smaller models for faster model speed.

Ethics Statement
In this paper, we use the publicly available datasets for experiments. For the KCs, we build them based on Wikipedia, which is one of the largest online encyclopedia and is publicly available. Therefore, we believe we do not use any personal data that invades users' privacy.