ESPVR: Entity Spans Position Visual Regions for Multimodal Named Entity Recognition

,


Introduction
Named entity recognition (NER) is a fundamental task in the field of information extraction, which can automatically recognize named entities in text and classify them into predefined categories.NER has been widely used for many downstream tasks, such as entity linking and relationship extraction.With the rapid development of social media, multimodal deep learning is widely used to perform structured extraction from massive multimedia news and web product information.Among them, Multimodal Named Entity Recognition (MNER) aims to identify and classify named entities from text using images as auxiliary information.MNER can disambiguate multi-sense words by augmenting linguistic representations with visual information, resulting in superior performance compared to traditional Named Entity Recognition (NER).
While previous efforts have yielded promising results, they still fall short in effectively selecting visual information.For existing methods of utilizing visual information, we classify it into two types: global visual information and local visual information.
Some previous works (Lu et al., 2018;Zhang et al., 2018;Yu et al., 2020;Chen et al., 2021;Sun et al., 2021Sun et al., , 2020;;Liu et al., 2022a,b;Wang et al., 2022) consider that if the whole image information is input to the multimodal interaction module, then such image information is global visual information.However, the multimodal interaction module relies on attention to select the visual regions associated with the text for interaction.Therefore, the visual information that eventually interacts with the text is mainly the local visual information related to the text.In other words, even if the input to the multimodal interaction module is the whole image information, finally the local visual information is selected using an attention-based method and then interacts with the text, so we consider it as a process of using attention to select the local visual information.When using attention to extract visual regions, attention is distracted by the entire image, rather than being fully focused on the visual regions most relevant to the text.Therefore, using an attention-based method to select the local visual information not only obtains valuable visual information but also introduces irrelevant visual information.
Besides, most of the previous approaches (Wu et al., 2020;Wang et al., 2020;Zheng et al., 2020;Zhang et al., 2021;Wang et al., 2021) use object detection (e.g., Mask R-CNN) to detect visual ob-ject regions, and treat visual objects as local visual information to interact with the text.However, object detection has a limited range of recognition categories, so it may not detect all objects within the categories defined by the dataset.Moreover, the visual regions obtained by object detection may not correspond to the entities in the text.
In summary, the goal of these methods is not to extract the most relevant visual regions for the entities in the text.The visual regions obtained by the above two methods may be redundant or insufficient for entities contained in the text, leading to identifying a non-entity as an entity or incorrectly predicting an entity category.Therefore, to obtain the most relevant visual regions for the entities in the text, we propose an Entity Spans Position Visual Regions (ESPVR) module.Specifically, the ESPVR module consists of two modules: Entity Spans Identifying (ESI) module, Visual Regions Positioning (VRP) module.First, the ESI module identifies all entity spans in the text.Then the VRP module uses these entity spans to extract entity features and uses the entity features to locate the visual regions that are most relevant to the entities in the text.
To summarize, the major contributions of our paper are as follows: • We propose a novel ESPVR module for MNER, which can select the most relevant visual regions for the entities in the text.The ESPVR module consists of two modules: Entity Spans Identifying (ESI) module, and Visual Regions Positioning (VRP) module.
• We conduct extensive experiments on two benchmark datasets, Twitter-2015 and Twitter-2017, to evaluate the performance of our ESPVR module.Experimental results show that the ES-PVR module outperforms the current state-of-theart models on Twitter-2017 and yields competitive results on Twitter-2015.

Related work
In general, studies about MNER are similar in terms of text feature extraction.However, there are differences in research methods when using image information and fusing modal information.The existing work can be classified into the following two categories based on the use of visual information: (1) The entire image is equally segmented into multiple visual regions by the convolutional archi-tecture (e.g., ResNet), and then using the multimodal interaction module with attention to select the visual regions associated with the text.
In fact, not all visual regions within an image are beneficial to improve the accuracy of the model prediction.To address this problem, some researchers proposed a method of first dividing the whole image into multiple visual regions equally, and then extracting the most relevant visual regions to the text in order to filter out irrelevant visual information.The visual information that ends up interacting with the text is actually the local visual information after filtering, even if the input is global visual information.Lu et al. (2018) used the pretraining ResNet model to extract visual regions and then added them to the text embedding by a visual attention model.To make full use of text and visual information, Zhang et al. (2018) used adaptive co-attentive networks to fuse text embedding and visual regions representation.Yu et al. (2020) proposed a multimodal interaction module to obtain both image-aware word representations and word-aware visual representations, and used textonly entity spans detection as an auxiliary module to mitigate visual bias.Chen et al. (2021) used an external knowledge database to obtain the final multimodal representation by attention-guided visual layer.
Irrelevant text-image pairs account for a large proportion of the dataset.Therefore, Sun et al. (2021) and Sun et al. (2020) used a modified BERT encoder to obtain information for inter-modal fusion and then introduced text-image relationship classification as a subtask to determine whether image features were useful.In addition, Liu et al. (2022a) proposed a novel uncertainty-aware framework for social media MNER.
In addition, for the problem of fine-grained semantic correspondence between objects in images and words in the text.Liu et al. (2022b) performed an enhanced representation of each word in the text by semantic enhancement and performed crossmodal semantic interaction between text and vision at different visual granularities.Wang et al. (2022) proposed a Scene graph driven Multimodal Multigranularity Multitask learning framework.
(2) Obtaining visual objects from the whole image by object detection (e.g., Mask R-CNN), and treating visual objects as local visual information to interact with the text.
Visual objects are considered fine-grained image Although many multimodal neural techniques have been proposed to incorporate images into the MNER task, the ability of models to exploit multimodal interactions remains unclear.Therefore, Chen et al. (2020) provided an in-depth analysis of existing multimodal fusion techniques from different perspectives and described a situation where the use of image information did not always im-prove performance.Based on this work, Wang et al. (2021) aligned image features into the text space by using image-text alignment to better utilize the attention mechanism in transformer-based pre-trained textual embeddings.

Method
In this section, we first introduce the task definition, and then describe the proposed model in detail.
Task Formulation: Given a text and image pair (X; V ) as input, where X = {x 1 , ..., x n } and n denotes the length of the text, the goal of MNER is to extract a set of entities from X and classify each entity into one of the pre-defined categories with the assistance of image information, e.g., Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC).As with most existing work in MNER, we regard the task as a sequence labeling problem.Specifically, let Y = {y 1 , y 2 , ..., y n } represent a label sequence corresponding to X, where y i ∈ ζ and ζ is the pre-defined label set with standard BIO2 tagging schema.We first obtain word representations and visual representations, respectively.Then, to obtain local visual information, we deploy Entity Spans Position Visual Regions (ESPVR) module to position the visual regions that are most relevant to the entities in the text.Next, a Multimodal Interaction module is devised to fully capture cross-modality semantic interaction between textual hidden representation and visual regions hidden representation.Finally, the CRF Decoding module assigns an entity label to each word in the input sequence, leveraging the hidden representations obtained from the Multimodal Interaction module.

Feature Extraction Module
Word Representations: To better model the semantic information of text X and get different representations for the same word in different contexts, we leverage pre-trained language model BERT (Devlin et al., 2018) as our text encoder.Moreover, image captions from an image captioning model can fully describe the whole image and provide more semantic information.Therefore, to let the text learn contextual information from an image caption, we first use [SEP ] to concatenate the text and image caption as a cross-modal input view rather than the text-only input view.Then, we denote the input as

Entity Spans Position Visual Regions (ESPVR) Module
As shown on the left side of the ESPVR module in Fig. 1, we first use a transformer layer with self-attention (Vaswani et al., 2017) to capture the intra-modality relation for the text modality and obtain each word's textual hidden representation T = {t 0 , t 1 , ..., t n+1 }, where t i ∈ R d denotes the generated hidden representation for x i .
Entity Spans Identifying (ESI) module: The purpose of the EBI is to identify the position of the head and tail of the entities in the text, which can be used for positioning the visual regions that are most relevant to the entities in the text.
We remove the type information and define the set of span labels Z ′ = {B, I, O}, and use Z = {z 1 , ...z n } to denote the sequence of labels, where z i ∈ Z ′ .Subsequently, T is fed into the Span_CRF decoding layer to predict a sequence Z of labels of X.
Visual Regions Positioning (VRP) Module: After obtaining the sequence Z of labels of X, we first need to select entity features E = {e 1 , e 2 , ..., e m } corresponding to entities from T based on the labels in Z, where e i stands for the feature of the i − th entity, m stands for the number of entities included in the text.Then, to maintain the same scale as most of the original images, we extract a visual region feature v of size α × α that is most relevant to the entities in the text.The specific process is as follows: First, we select a visual region feature b c from {b 1 , ..., b 144 } that is most relevant to the entities in the text.Specifically, we take each entity feature in E = {e 1 , e 2 , ..., e m } as Q, each visual region feature in {b 1 , ..., b 144 } as K, and calculate the correlation score F (e i ,b j ) between each visual region feature and each entity feature, and sum up the correlation scores of each visual region feature and all entity features to obtain 144 correlation scores.We select a visual region feature b c corresponding to the maximum value from the sum of 144 correlation scores: where d k is the dimension of key vector K.
Then, we select neighboring visual region features based on the index c of b c .Specifically, the positions of all visual region features are expressed in row and column coordinates, and visual region features with row and column distances less than α are identified as neighboring visual region features of visual region feature b c , where the range of α is 1 ≤ α ≤ 11.Because the number of all visual regions is 12 × 12, the upper limit of α is 11.
Next, we select a neighboring visual region feature b l that is most relevant to b c from all neighboring visual region features.Specifically, we take the visual region feature b c as Q and each neighboring visual region feature as K, use attention to calculate the correlation score between b c and each neighboring visual region feature, and select a neighboring visual region feature b l with the largest correlation score.
Finally, to maintain the same scale as most of the original images, we extract a visual region feature v of size α × α.Specifically, we compare the coordinate of the visual region feature b c and the coordinate of the neighboring visual region feature b l , and select a minimum number of rows and columns as the coordinates of the top-left visual region feature b t for the visual region feature v.And using α as the edge length of v, we add α to the rows and columns of coordinate t to get the coordinate d of the down-right visual region feature b d for visual region feature v.We use all the visual region features that lie within the range of t and d to form an overall visual region feature v.

Multimodal Interaction Module
Following (Yu et al., 2020), we stack the crossmodality Transformer layers to learn the crossmodal interaction between the words and visual regions.The components of the cross-modality Transformer (CMT) layer are the same as the Transformer.
To obtain image-aware word representations, we stack two CMT layers to perform superior-level semantic interaction.These two CMTs are internally calculated in the same way, except that the Q, K, and V are from different sources.In the first stage, we first perform multi-head Cross-Model Attention by treating v as Q, and T as K and V : where CA i is the i − th head of Cross-Modal Attention, {W qi , W ki , W vi } ∈ R d/m×d refers to the weight matrices for the Q, K, and V respectively, and W ′ ∈ R d×d multi-head attention.Then, we obtain the output P = {p 0 , p 1 , ..., p n+1 } of the first CMT layer: (5) where LN and F F N stand layer normalization and feed-forward network respectively.In the second stage, we treat T as Q, P as K and V .Then the second CMT layer generates the image-aware word representations A = {a 0 , a 1 , ..., a n+1 }.
To obtain word-aware visual representations, we use a CMT layer to perform basic-level semantic interaction.We treat T as Q, and v as K and V .Then the word-aware visual representations Q = (q 0 , q 1 , ..., q n+1 ) can be computed through Equation.3-Equation.6.
To trade off the cross-modality contributions, we use a gate function to obtain the final semantic interaction representation H = {h 0 , h 1 , ..., h n+1 }: where A are image-aware word representations, Q are word-aware visual representations, {W a , W q } ∈ R d×d refer weight matrices, and σ stands the element-wise sigmoid function.

CRF Decoding Module
Conditional Random Fields (CRF) take into account the correlations between labels in neighboring positions and assign a score to the entire sequence of labels.This approach can lead to improved accuracy in sequence labeling tasks.Consequently, given a sequence X, all the possible label sequences y can be produced as follows: ) where S i (y i−1 , y i , X) and S i (y′ i−1 , y′ i , X) are potential functions.

Model Training
There are two tasks in our proposed ESPVR model: MNER, and ESI.In the training phase, we jointly train the whole model.The final training objective function L is the combination of MNER loss and ESI loss.By minimizing negative log-likelihood estimation, L can be denoted as: where λ is a hyperparameter to control the contribution of the auxiliary ESI module.Here we set λ to 0.08

Experiments
In the following section, we conduct experiments on two MNER datasets, comparing our Entity Visual Regions Positioning (ESPVR) approach with several unimodal and multimodal approaches.

Experiment Settings
Datasets: We use two publicly MNER datasets (Twitter-2015 (Zhang et al., 2018) and Twitter-2017 (Lu et al., 2018) ) to evaluate the effectiveness of our framework.Twitter-2015 and Twitter-2017 include multimodal tweets from 2014 to 2015 and from 2016 to 2017 respectively.Both datasets are composed of four types of entities: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC) (In Twitter-2017, the last tag is Other.Here, we collectively refer to them as MISC).Each sample in the two datasets is composed of a pair sentence, image.
Implementation Details: For both datasets, we use the same hyperparameters.To compare each unimodal and multimodal method in the experiment, the maximum length of the text is set to 128 which can cover all words.For our ESPVR approach, most of the hyperparameters are set in the following aspects: The word representations C are initialized with the pre-trained BERT (bert-baseuncase) model of dimension 768 by Devlin et al. (2018), and fine-tuned during training.The visual embeddings B are initialized by Swin Transformer with the dimension of 1536.Swin Transformer is fixed during training.The Self-Attention layer has a head size of 8 and a number of 4. Additionally, the Cross-Modal Attention has feature dimensions of 512.The learning rate, the dropout rate, and the tradeoff parameter are respectively set to 1e-4, 0.4, and 0.08, which can achieve the best performance on the development set of both datasets via a small grid search over the combinations of [1e-5, 1e-4], [0.1, 0.5], and [0.05, 0.9].We implement the proposed model with PyTorch (Paszke et al., 2019).The model is trained and tested on one Nvidia GeForce-RTX 2080 GPU with batch size 32.

Main Results
Following the other baselines, we employ standard precision (P), recall (R), and F1 score (F1) to evaluate the overall performance and report F1 for every single type of metric.Since the two Twitter datasets differ significantly in type distribution and data characteristics, we also conduct extensive experiments in the self-domain and cross-domain cases to demonstrate the validity and generality of our approach.
Self-domain Scenario.Table 1 shows the overall results of unimodal and multimodal approaches on the two benchmark Twitter MNER datasets.From the table, we have the following findings: (1) Pre-trained model BERT is more powerful than conventional neural networks.This indicates that the pre-trained model can indeed provide abundant syntactic and semantic features.CRF considers the correlations between labels in neighborhoods and scores the whole sequence of labels.Therefore, the recent approaches are typically based on BERT-CRF.
(2) Multimodal approaches can usually perform better than their corresponding unimodal baselines.By comparing all multimodal and unimodal approaches, we can find that both global images and visual information of local objects are valuable to MNER.This confirms that visual information can bring a wealth of external knowledge to the text.However, this approach does not bring very significant improvement, which demonstrates that MNER still has considerable space for progress in proposing a more effective multimodal approach.
(3) Our ESPVR approach achieves state-of-theart performance on Twitter-2017 dataset and competitive results on Twitter-2015.To position the visual regions that are most relevant to the entities in the text, we design an ESPVR module.In comparison with the existing multimodal methods, our approach outperforms the state-of-the-art MAF by 0.27 on Twitter-2017 but performs slightly worse on Twitter-2015.This is because there are many unmatched text-image pairs, and it is one direction of our future work.

Ablation Study
To show the effectiveness of each component in ESPVR, we conduct an ablation study by removing the particular component from it.w/o IC.This approach completely ignores the global information brought by image captions.We remove the image captions, resulting in reduced performance, which shows image captions can add external support for each input word.
w/o ESPVR.This approach completely ignores the problem of fine-grained semantic correspondence between the semantic units in the text-image pair.When we remove the ESPVR module and solely train the main MNER task, there is a noticeable decline in overall recall and F1 scores, alongside a slight improvement in overall precision.The result is consistent with our hypothesis that visual regions can provide clues for fine-grained semantic interaction.
w/o IC + ESPVR.This approach completely ignores global visual information and local visual information.We remove image captions and the ESPVR module, resulting in significant degradation of the performance of the model, indicating that both image captions and the ESPVR module 1 are essential in our framework.Removing image captions has a slightly greater impact than removing the ESPVR module.This is probably because some images are not relevant to the text.Overall, the different components of our model work effectively with each other to produce a better performance of the model in the MNER task.

Further Analysis
To validate the effectiveness and robustness of our method, we conduct further analysis with three specific examples in Table 4.
For informal or incomplete text, if the corresponding visual information is provided, the visual context will provide useful clues to the text.For example, in Table 4.A, the image's most obvious local visual information is the person, and all methods can obtain this local visual information with significant features.Therefore, all the multimodal approaches can correctly classify their types after incorporating the image.
It is essential to obtain local visual information from the image that is relevant to the entities in the text.If the obtained local visual information is redundant or insufficient, it may result in misidentifying a non-entity as an entity or incorrectly predicting the entity category.For example, in Table 4.B, this text does not contain an entity, and it should not provide local visual information for this text under normal situations.However, the existing methods for obtaining local visual information all obtain a large amount of visual information from this image.So, those methods identify a non-entity as an entity.Another example is Table 4.C, UMT and UMGF use the error guidance of local visual information about the Person from the image and omit the relationship between Person and Person, resulting in the identification of "newbalance" as "PER".On the contrary, MGCMT and our method can accurately determine the entity.Here A can get the correct result because it uses the local visual information obtained in both ways, thus refining the local visual information.

Conclusion
In this paper, we present an Entity Spans Position Visual Regions (ESPVR), which obtains the most relevant visual regions for the entities in the text as fine-grained local visual information.The experimental results reveal the superiority of finegrained local visual information acquired through this method.This information proves more advantageous in enhancing the performance of Multimodal Named Entity Recognition (MNER) compared to using attention-based and object detection-based methods.

Limitations
Although our experiments demonstrate the effectiveness of our method, there are still some limitations that can be improved in future work, First, data augmentation is a necessity to enhance data efficiency in deep learning.Our model lacks multimodal data enhancement.Second, there are many irrelevant text image pairs in the data of MNER, and our method aims to solve the problem of acquiring local visual information.By filtering out textirrelevant images prior to obtaining local visual information and focusing solely on text-relevant images for acquiring local visual regions, the efficacy of our proposed method is likely to be further amplified.We hope that the insights from this work will stimulate further research on MNER performance.

Fig. 1
Fig. 1 illustrates the overall architecture of the ES-PVR, which consists of four major modules: (1) Feature Extraction module; (2) Entity Spans Position Visual Regions (ESPVR) module; (3) Multimodal Interaction module; (4) CRF Decoding module.We first obtain word representations and visual representations, respectively.Then, to obtain local visual information, we deploy Entity Spans Position Visual Regions (ESPVR) module to position the visual regions that are most relevant to the entities in the text.Next, a Multimodal Interaction module is devised to fully capture cross-modality semantic interaction between textual hidden representation and visual regions hidden representation.Finally, the CRF Decoding module assigns an entity label to each word in the input sequence, leveraging the hidden representations obtained from the Multimodal Interaction module.
where x i is the word of the text, x i ′ is the word of the image caption, [CLS] and [SEP ] are special tokens of BERT.Lastly, the text X′ is fed into the BERT to get the word contextualized representations C = {c 0 , c 1 , c 2 , ..., c n+1 } is the generated contextualized representation for x i , c i ∈ R d , d stands for the dimension of the word embedding.Visual Representations: To obtain better visual representations, we use the pre-trained model Swin Transformer(Liu et al., 2021).Specifically, given an image V , the visual representations B = {b 0 , b 1 , ..., b 144 } are obtained by extracting the output of the last layer of Swin Transformer, where b 0 represents the feature of the whole image, usually used for image classification, {b 1 , ..., b 144 } are the 12 × 12 = 144 visual region features divided by Swin Transformer, and each region is represented by a 1536-dimensional vector.

Table 2 :
Performance comparison in the cross-task scenario.
Table3shows comparison results between the full model and its ablation methods.

Table 4 :
The second row shows a few representative samples from the test set and their manually labeled entities.The bottom four rows show the prediction results of different methods for these test samples.