Entity-level Interaction via Heterogeneous Graph for Multimodal Named Entity Recognition

,


Introduction
Multimodal Named Entity Recognition (MNER) aims at combining both textual and visual contents to detect and classify named entities from multimodal social media posts (e.g., tweets).Different from traditional NER (Torisawa et al., 2007;Lample et al., 2016;Ma and Hovy, 2016) that focuses on formal single-modal texts, MNER confronts two specific challenges: 1) How to capture useful entityrelated visual information; 2) How to alleviate the interference of visual noise.As shown in Figure 1, the MISC entity "Oscars" appearing in the tweet may be wrongly recognized as PER, since it can refer to both a person name and the movie award.But the trophies in image can help figure out that the "Oscars" actually indicates the latter.Effectively capturing entity-related information from the image is essential and challenging.Though helpful, incorporating images may also interfere the nonentity tokens that have no corresponding visual information, and makes them easily misidentified as entities.Effectively alleviating the interference brought by images is also a critical challenge.
Recent works on MNER have gained progress by either improving cross-modal interacting mechanisms (Zhang et al., 2018;Lu et al., 2018;Yu et al., 2020), or seeking for better visual features (Wu et al., 2020;Chen et al., 2020;Zhang et al., 2021).However, existing methods neglect the integrity of entity semantics and directly interact all textual tokens with visual features, which we regard as token-level interaction.Though straightforward, token-level interaction fails to use integral entity semantics to capture related visual information, and makes non-entity tokens easily interfered with by visual noise.
Thus in this paper, we propose an end-to-end heterogeneous Graph-based Entity-level Interacting model (GEI) for MNER.As shown in Figure 1, the key insight of entity-level interaction is obtaining entity representations to query related visual information from object features, which has several benefits: 1) Entity representations carry integral entity semantics, which can capture entity-related information effectively.2) Interacting visual features with only entity representations instead of all token representations can protect non-entity tokens from the interference brought by images.In detail, GEI first introduces a span detection subtask to obtain entity representations, which serve as the bridge between two modalities.Then, a multimodal heterogeneous graph is constructed with token, entity and object nodes, whose semantic relationships are modeled by four kinds of edges.After that, GEI interacts entity nodes with object nodes to capture related visual information, and fuses it to token nodes that are connected with entity nodes.Finally, a CRF layer is employed to decode named entities from object-aware token representations.
Overall, our contributions are as follows: 1) We propose a novel end-to-end model GEI for MNER.GEI interacts entity representations with visual objects to capture useful entity-related visual information, and excludes non-entity tokens from the interaction to rid them of the visual noise.
2) We conduct experiments on two widely used datasets Twitter-2015 (Zhang et al., 2018) and Twitter-2017 (Lu et al., 2018).The results show that GEI tackles MNER challenges effectively and demonstrate the effectiveness of our GEI.

Methodology
Figure 2 shows the architecture of GEI, which contains the following components: Entity Representation Extractor (ERE), Object Feature Encoder (OFE), Heterogeneous Graph Interacting Network (HGIN), CRF Decoding modules.

Entity Representation Extractor
Given an input sentence X = {x i } |X| i=1 , where x i is the i th token and |X| is the max sequence length, we employ BERT pre-trained by Devlin et al. (2018) as our text encoder, and obtain con-textualized token embeddings C = {c i } |X| i=1 .Then, we use a Transformer (Vaswani et al., 2017) to gain hidden representation of each token (1) After that, we project token representations to the multimodal space via a linear transformation: T = We introduce a span detection subtask to construct entity representations, which are used to capture entity-related visual information and serve as the bridge between two modalities in HGIN.Firstly, we feed C to another Transformer layer to obtain specific hidden representations of the subtask: (2) Then, we use a CRF (Lafferty et al., 2001) layer to recognize possible entity spans , where |E| is the entity number, s i and e i are start and end indexes of the i th entity.The span detection loss L sd is as follows: where φ i (z i−1 , z i , X) and φ i (z ) are potential functions.Finally, we obtain representation of each entity through maxpooling its constituent token representations:

Object Feature Encoder
We propose OFE module to encode the input image to visual features.Considering that visual objects have similar semantic granularity with entities, we acquire object representations as our image features.Given an input image I, we first use the object detection algorithm DETR (Carion et al., 2020) to detect bounding boxes of visual objects O = {o i } |O| i=1 , where |O| is the number of detected objects.Then, we concatenate I and O, feed them to the 152-layer ResNet (He et al., 2016) and take the output from the last pooling layer V = {v i } |O|+1 i=1 as the visual features: (5) After that, we project visual features to the multimodal space via a multi-layer perceptron with ReLU activation function:

Heterogeneous Graph Interacting Network
We design HGIN module to capture entity-related visual information via entity-level interaction, and fuse it to entity-associated token representations.
Graph Construction.As shown in Figure 2, the multimodal heterogeneous graph G contains three kinds of nodes: textual token nodes N T i = t i , entity nodes N E i = E i , and visual object nodes We introduce following kinds of edges for G: 1) Entiy-Object Edge: N E i and N V j are fully connected to capture the entity-related visual information.2) Entity-Token Edge: N E i is connected with associated token nodes {N T j } e i j=s i to enhance them with entity-related visual information.3) Intra-modal Edge: To capture intra-modal interactions, all token nodes {N T i } |X| i=1 are fully connected with each other, so do object nodes.
Cross-modal Interaction.Firstly, we employ multi-head self-attention on the intra-modal edge to exploit contexts of the same modality: where m ∈ {T, V }, H m i is the hidden feature of node N m i at the l th layer.Then, we interact entity nodes with object nodes via a gated cross-attention module: where E are object-aware entity representations Similarly, we obtain entity-aware object representations M (l) V .After that, we fuse visual information from M (l) E to its associated token nodes: α (l) where D (l) , and then update H ).
After fusing entity-related visual information into corresponding token representations, we ap- ply a CRF layer to conduct sequence labeling and obtain the entity recognition loss L mner : where φ i (y i−1 , y i , X) and φ i (y ) are potential functions.When training, we sum the loss mentioned above as the final loss: L = λ 1 L sd + λ 2 L mner , where λ 1 and λ 2 are hyperparameters.
Baselines and Metrics.For a thorough comparison, we compare our approach with two groups of baseline models.Firstly, the representative text-based NER approaches: 1) CNN-BiLSTM-CRF (Ma and Hovy, 2016), which is a classical text-based neural network for NER with both the word-level and character-level information.2) HBiLSTM-CRF (Lample et al., 2016), which is an improvement of CNN-BiLSTM-CRF, replacing  Zhang et al. (2021).The boldface and underlined numbers are the best two results in each column.† refers to significant with p-values < 0.05 when comparing unimodal baselines.
the bottom CNN layer with LSTM to build the hierarchical structure.3) BERT (Devlin et al., 2018), which is a competitive baseline for NER with multilayer bidirectional Transformer encoder and followed by stacking a softmax layer for entity prediction.4) BERT-CRF, a variant of BERT, which replaces the softmax layer with a CRF layer.Secondly, several competitive multimodal approaches for MNER: 5) VG (Lu et al., 2018), which utilizes a visual attention and a gate mechanism to expoit implicit information from a whole image to guide word representation learning based on HBiLSTM-CRF.6) ACoA (Zhang et al., 2018), which designs an adaptive co-attention network to learn wordaware visual representations and vision-aware word representations based on CNN-BiLSTM-CRF.7) UMT (Yu et al., 2020), which extends Transformer to multi-modal version and incorporates the auxiliary entity span detection module.8) Object-AGBAN (Zheng et al., 2020), which proposes an adversarial bilinear attention network to capture the correlations of visual objects and textual entities.9) OCSGA (Wu et al., 2020), which combines dense co-attention network (self-attention and guide attention) to model the correlations between visual objects and textual entities.10) UMGF (Zhang et al., 2021) , which proposes a unified multimodal graph fusion approach for MNER and achieves current SOTA on Twitter-2015.11) Captions (Chen et al., 2020), which uses image captions as visual features and achieves current SOTA on Twitter-2017.Following previous works, we take Micro F1-score as the evaluation metric.
Implementation Details.We use the Adam (Kingma and Ba, 2014) optimizer with a learning rate 3e-05.We set the batch size to 16.The number of gnn layer is set to 6, the λ 1 and λ 2 are set to 0.5 and the dropout rate is set to 0.4.The head number of multi-head attention is set to 8.For all experiments, we train and test our model on a Tesla-V100 GPU.We take the average F1 scores of three experiments as our final result.To alleviate the error propagation caused by the gap between training and predicting, we take the scheduled sampling strategy (Bengio et al., 2015).Specifically, when training, GEI gradually switches the span detection results from golden label to the model predictions on its own.From epoch 2 to epoch 6, GEI linearly increases the proportion of predicted span detection results from 0% to 90%.

Results and Analysis
Table 1 shows the main results of GEI compared with the baseline models on both Twitter-2015 and Twitter-2017.Results show that our proposed framework GEI significantly outperforms UMGF by 1.93% and 1.18% on Twitter-2017 and Twitter-2015, respectively.Further, comparing with Captions, GEI also surpasses 2.94% F1-scores on Twitter-2015 and has a competitive performance on Twitter-2017.Besides, GEI outperforms all baseline models that also use visual objects, which suggests that conducting cross-modal interaction at entity-level can effectively exploit useful visual information from object features.
Ablation Study.To further investigate the effectiveness of entity-level interaction, we conduct ablation experiments on 3 variants: 1) -Entity-level Interaction, removing entity nodes from the multimodal graph and directly interacting object nodes with entity-associated token nodes.2) -Span Detection, further removing the span detection subtask and interacting all token nodes with object nodes.
3) +Image Region, replacing visual object features with fixed region features following Yu et al. (2020).
From Table 1, we can observe that: 1) Employing entity representations that carry integral entity semantics to capture entity-related visual information is important and contributes +0.65% / +0.50% F1score.2) Excluding non-entity tokens from the cross-modal interaction to alleviate the visual interference is essential and improves the performance significantly.3) Compared with fixed image regions, employing visual objects that have similar semantic granularity with entities is preferable and enhances +0.44% / +0.91% F1-score.
Case Study. Figure 3 shows two representative examples which intuitively demonstrate the effectiveness of our method.1) For the left example, UMT and UMGF misidentify the PER entity "Leonardo" as MISC, while GEI extracts both entities correctly.It shows that our proposed GEI captures the entity-related visual information effectively via entity-level interaction (i.e., "Leonardo" and people appearing in the image).2) For the right example, both UMT and UMGF suffer from the interference brought by the image and mislabel nonentity token "HURRY" as a PER entity.However, due to excluding non-entity tokens from the crossmodal interaction, our GEI rids the "HURRY" of the visual noise and makes the prediction correctly.This noticeable phenomenon indicates that our framework alleviates the interference brought by images via entity-level interaction.
Visualization of Entity-level Interaction.To gain an insight into the interaction between entities and visual objects, we visualize the cross-modal attention weights between entity nodes and visual object nodes for the example appearing in Figure 1.As shown in Figure 4, it is obvious that two PER entities "Attenborough" and "Ben Kingsley" have greater weights with two person objects than other visual objects during cross-modal interaction.The same phenomenon exists between the MISC entity "Oscars" and two trophy objects.These find- ings confirm that our framework can effectively capture entity-related visual information through entity-level cross-modal interaction.

Conclusion
In this paper, we propose an heterogeneous Graphbased Entity-level Interacting model (GEI) for MNER.GEI interacts entity representations with visual objects to capture useful entity-related visual information, and excludes non-entity tokens from the interaction to rid them of the visual noise.Experiments on two public MNER datasets demonstrate the effectiveness of our method.

Limitations
MNER methods have gained impressive progress on multimodal social media NER by incorporating complementary visual features.Though helpful in many cases, incorporating images may also bring diverse interference to the task.In this paper, we focus on the interference suffered by non-entity tokens, and alleviate it by excluding non-entity tokens from the cross-modal interaction process.However, existing methods (including our GEI) are still confronted with inevitable interference when the image is irrelevant with the text or contains ironic meaning.How to effectively alleviate such kind of interference remains to be studied in future.

Figure 1 :
Figure 1: An example from the public social media MNER dataset Twitter-2015, and the difference between previous and proposed cross-modal interacting methods.

Figure 2 :
Figure 2: The overall architecture of GEI.

Figure 3 :Figure 4 :
Figure 3: Case study of our proposed GEI, previous SOTA methods UMT and UMGF.The bottom three rows are predicted entities of different approaches.

Table 1 :
Performance comparison on the two MNER datasets.Result marked by * is conducted via the released code of