Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction

Multimodal named entity recognition and relation extraction (MNER and MRE) is a fundamental and crucial branch in information extraction. However, existing approaches for MNER and MRE usually suffer from error sensitivity when irrelevant object images incorporated in texts. To deal with these issues, we propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction, aiming to achieve more effective and robust performance. Specifically, we regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision. We further propose a dynamic gated aggregation strategy to achieve hierarchical multi-scaled visual features as visual prefix for fusion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance. Code is available in https://github.com/zjunlp/HVPNeT.


Introduction
Named entity recognition (NER) and relation extraction (RE) are important tasks in information extraction and knowledge base population, due to its research significance in natural language processing (NLP) and wide applications (Hosseini, 2019;Qin et al., 2021;Zhang et al., 2021c). Currently, with the rapid development of multimodal learning, multimodal NER (MNER) and Multimodal RE (MRE) methods (Moon et al., 2018;Zheng et al., 2021) have been proposed to enhance linguistic representations with the aid of visual clues from images. It significantly extends the text-based models by taking images as additional inputs, since the visual contexts help to resolve ambiguous multi-sense words. The essence of MNER and MRE tasks is how to learn great visual features and how to incorporate it into textual representation for enhancing NER and RE. Early methods Moon et al., 2018) study how to incorporate the feature of whole image into the textual representation. ; ; Zheng et al. (2021) further validate that object-level visual fusion is more specific and important for MNER and MRE. Recently, RpBERT  propose to train a classifier of whether the "Image adds to the tweet meaning" before MNER tasks. However, they heavily rely on pre-training on large extra annotated corpus of image-text relevance and only focus on the whole image with ignoring the bias of relevant object-level visual fusion. In practice, irrelevant objects may directly exert negative effects on the text inference. Meanwhile, it is not trivial to acquire absolutely relevant object-level visual information to enhance the text. Thus, an effective method should be derived to learn better visual representation and alleviate error sensitivity of irrelevant object images for social media NER and RE tasks.
Considering images often appear before the text in a web document, we argue that images can be regarded as the prefix for their textual descriptions, which is inspired by prompt learning Li and Liang, 2021;Liang et al., 2022;Zhang et al., 2021d) in the language model. Specifically, given a image-text pair, we prepend object-level image feature sequence of length V i (visual prefix) to the text sequence at each self-attention layer of BERT (Devlin et al., 2019). Note that the visual prefix is a pluggable operation and don't require any annotation on relevance. Therefore, visual prefix can not only introduce object-level visual signals, but also further reduce the impact on the architecture representing text. Intuitively, visual prefix regarded as a prompt for text may helps alleviate the error sensitivity of irrelevant object images.
While Convolution Neural Networks (CNNs) contain the multi-scale information with pyramidal feature hierarchy (Ren et al., 2015) from low to high levels. And BERT encodes a rich hierarchy of linguistic information (Jawahar et al., 2019) from the bottom to the top. Inspired by Lin et al. (2017);  that objects of different sizes can have appropriate feature representations at the corresponding scales, we propose to make each layer of BERT aware of hierarchical multiscale visual features to make a more enlightened and comprehensive forecasting decision.
To this end, we propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction. Specifically, inspired by SimVLM , we propose visual prefix-guided fusion mechanism involving concatenate object-level visual representation as the prefix of each selfattention layer in BERT, which is a more soft and robust attention module for visual enhanced NER and RE. We further design a dynamic gate for each layer to generate image-dependent paths, so that a variety of aggregated hierarchical multi-scaled visual features can be considered as visual prefix for enhancing NER and RE. Overall, we summerize the major contributions of our paper as follows: • We present a hierarchical visual prefix fusion network towards MNER and MRE, incorporating hierarchical multi-scaled visual features through visual prefix-based attention mechanism at each self-attention layer of BERT to generate effective and robust textual representation for reducing error sensitivity.
• We utilize the exploitation of dynamic gates to fully leverage the hierarchical visual features.
Thus, textual representation of each layer in Transformer can be aware of corresponding hierarchical visual features adaptively. To the best of our knowledge, this paper is the first work to leverage hierarchical pyramidal visual features for multimodal learning.
• We evaluate our method on MNER and MRE tasks. Our experimental results on three benchmark datasets validate the effectiveness and superiority of our HVPNeT 2 Related work Multimodal Entity and Relation Extraction As the crucial components of information extraction, named entity recognition (NER) and relation extraction (RE) have attracted much attention in the research community (Liu et al., 2019;Chen et al., 2021b,a). Previous studies typically focus on textual modality and standard text. As multimodal data become increasingly popular on social media platforms, early research focusing on textual modality and standard text is limited. Recently, several studies have focused on the MNER and MRE task, aiming to utilize the associate images to recognize the named entities and their relation better. In the early stages, , , (Moon et al., 2018) and Arshad et al. (2019) propose to encode the text through RNN and the whole image through CNN, then designing implicit interaction to model information between two modalities to explore multimodal NER tasks. Recently, ;  propose to leverage regional image features to represent objects in the image to exploit fine-grained semantic correspondences based on Transformer and visual backbones.
While most of the current methods ignore the error sensitivity, one exception is that , which proposes to learn a text-image relation classifier to enhance multimodal BERT to reduce the interference from irrelevant images while requiring extensive annotation for the irrelevance of image-text pairs.

Pre-trained Multimodal Representation
The pre-trained multimodal BERT has recently achieved significant improvements in many multimodal tasks (e.g., visual question answering). We summarize and compare The existing visual-linguistic BERT models can be divided into two aspects as follows: 1) Architecture. The single-stream structures consist of Unicoder-VL , VisualBERT (Li et al., 2019), VL-BERT (Su et al., 2020), and UNITER (Chen et al., 2020b), where the text tokens and images are combined into a sequence and fed into BERT to learn contextual embeddings. The two-streams structures, LXMERT (Tan and Bansal, 2019) and ViLBERT (Lu et al., 2019), separately process the visual and language into two streams with interacting through cross-modality or co-attentional transformer layers. 2) Pretraining tasks. The pretraining tasks of multimodal visual-language model mainly consist of masked language modeling (MLM), masked region classification (MRC), and image-text matching (ITM). However, most of previous models are pre-trained on the datasets of image captioning (Sharma et al., 2018; or visual question answering where multimodal interactions are required. Applying current visual-language models to the MNER and MRE task may not result in a good performance, since MNER and MRE mainly focus on leveraging visual information to enhance the text rather than conducting prediction on the image side.

Methodology
As illustrated in Figure 2, we present a novel hierarchical prefix fusion network for multi-modal entity and relation extraction. Note that our method can also be applied to other visual-enhanced tasks towards text.

Collection of Pyramidal Visual Feature
On the one hand, the image associated with a sentence maintains several visual objects related to the entities in the sentence, further providing more semantic knowledge to assist information extraction. On the other hand, the global image features may express abstract concepts, which play the role of a weak learning signal. Thus, we collect multiple visual clues for multimodal entity and relation extraction, which involves taking the regional image as the vital information and the global images as the supplement.
Given an image, we follow  to adopt the visual grounding toolkit (Yang et al., 2019) for extracting local visual objects with top m salience. Then, we rescale the global image and object image to 224 × 224 pixels as the global image I and visual objects O = {o 1 , o 2 , ..., o m , }.
In the area of CV, the feature fusion method that leveraging features from different blocks of pretrained models Kim et al., 2018;Lin et al., 2017) is widely applied for improving model performance. Inspired by such practices, we take the first step to focus on the application of pyramid features in the area of multi-modality. We propose to fuse hierarchical image features into each Transformer layer; thus, leveraging a feature pyramid is essential. Typically, given an image, we encode it with a backbone model and generate a list of pyramidal feature maps {F 1 , F 2 , F 3 , . . . , F c } with different scales, then map them with M θ (·) as follows: (2) where i denotes the i-th block of the backbone model, c denotes the number of blocks in the visual backbone model (here is 4 for ResNet), Pool represents the pooling operation, where the features are aggregated to the same spatial sizes. The 1×1 convolutional layer is leveraged to map the pyramidal visual features to match the embedding size of the Transformer.

Dynamic Gated Aggregation
Although objects of different sizes can have appropriate feature representations at the corresponding scales, it is not trivial to decide which block in the visual backbone is assigned visual prefix for each layer in Transformer. To address this challenge, we propose constructing the densely connected routing space, where hierarchical multi-scaled visual features are connected with each transformer layer.

Dynamic Gate Module
We conduct routine processes through a dynamic gate module, which can be viewed as a procedure of path decision. The motivation of the dynamic gate is to predict a normalized vector, which represents how much to execute the visual feature of each block. In the dynamic gate, g (l) i ∈ [0, 1] denotes the path probability from the i-th block of visual backbone to the l-th layer of Transformer. It is calculated as g denotes the gating function according to the l-th layer in Transformer, c represents the numbers of the block in backbone. We first produces the logits α (l) i of the gate signals:

Resnet Block4
Object images [  where f (·) denotes the activate function Leaky_ReLU, P represents the global average pooling layer. We first squeeze the input features V i with a shape of (d i , h i , w) from the i-th bloc by an average pooling operation. Then we add the features from multiple blocks to generate the average vectors. We further reduce the feature dimension by c with the MLP layer W l and consider a soft gate via generating continuous values as path probabilities. Afterward, we generate the probability vector g (l) for the l-th layer of Transformer as follows:

Aggregated Hierarchical Feature
Based on the above dynamic gate g (l) , we can derive the final aggregated hierarchical visual feature V gated to match the l-th layer in Transformer, as: Formally, the final visual featuresṼ (l) gated corresponding to the l-th layer of Transformer is obtained by the following concatnation operation, which will be adopted to enhance layer-level representations of textual modality through visual prefixbased attention.

Visual Prefix-guided Fusion
We regard hierarchical multi-scaled image feature as visual prefix, and prepend the sequence of visual prefix to the text sequence at each self-attention layer of BERT (Devlin et al., 2019) In particular, given an input sequence X = {x 1 , x 2 , ..., x n }, the contextual representations H l−1 ∈ R n×d is first projected into the query/key/value vector: As for aggregated hierarchical visual features V (l) gated , we use a set of linear transformations W φ l ∈ R d×2×d for l-th layer to project them into the same embedding space 2 of textual representation in selfattention module. Besides, we define the operation of visual prompt φ l k , φ l v ∈ R hw(m+1)×d as: where hw(m + 1) represents the length of the visual sequences, m denotes the number of visual objects detected by the object detection algorithm. Formally, the visual prefix-based attention are calculated as follows: is beneficial to reduce error sensitivity for irrelevant object elements.

Classifier
Based on above description, we get the final representation of BERT, H L = U (X,Ṽ gated ), where U (·) denotes the operation of visual prefix-based attention. Finally, we conduct different classifier layers for NER and RE, respectively.
Named Entity Recognition. Following (Moon et al., 2018;, we also adopt the CRF decoder to perform the NER task. Formally, we feed the final hidden vectors H L = of BERT to the CRF model. For a sequence of tags y = {y 1 , . . . , y n }, the probability of the label sequence y and the objective of NER are defined as follows (Lample et al., 2016a): where Y represents the pre-defined label set with the BIO tagging schema, and S(·) represents potential functions. Details can be referred in (Lample et al., 2016a).

Relation
Extraction. An RE dataset can be denoted as D re = {(X (i) , r (i) )} M i=1 , the goal of RE is to predict the relation r ∈ Y between subject entity and object entity. Specifically, a [CLS] head is utilized to compute the probability distribution over the class set Y with the softmax function p(r|X) = Softmax(WH L [CLS] ), and the parameters of L and W are fine-tuned by minimizing the cross-entropy loss over p(r|X) on the entire X as follows:

Experiments
In the following section, we conduct experiments to evaluate our method on two multimodal information extraction tasks, MNER and MRE. Specifically, we adopt ResNet50 (He et al., 2016) as visual backbone and BERT-base (Devlin et al., 2019) as textual encoder. Results on three datasets demonstrate that our HVPNeT outperforms a number of unimodal and multimodal approaches.

Datasets
We select three datasets for our experiments: Twitter-2015  and Twitter-2017  for MNER, MNRE (Zheng et al., 2021) for MRE. Statistical details of datasets and experimental details are provided in Appendix A, B.

Compared Baselines
We compare our HVPNeT with several baseline models for a comprehensive comparison to demonstrate the superiority of our HVPNeT. Our comparison mainly focuses on three groups of models: the text-based models, previous SOTA MNER and MRE models, and the variants of our models.
Text-based models: we first consider a group of representative text-based models: , the newest SOTA for MNER, which proposes a unified multi-modal graph fusion approach for MNER. 5) BERT+SG is proposed in Zheng et al. (2021) for MRE, which concatenate the textual representation from BERT with visual features generated with scene graph (SG) tool (Tang et al., 2020). 6) MEGA (Zheng et al., 2021), the newest SOTA for MRE, which develops a dual graph for multi-modal alignment to capture this correlation between entities and objects for better performance. 7) VisualBERT (Li et al., 2019), different from the above SOTA methods mainly based on co-attention, VisualBERT is a single-stream structure, which is a strong baseline for comparison. And the results of VisualBERT listed in our paper are referred from Chen et al.
Variants of Our Model: we set the ablation experiments to explore the effectiveness of our design. We conduct on the same parameter settings of HVP-NeT for each variant model for a fair comparison.  HVPNeT-Flat: This is another variant of our model without the pyramid structure. Here we assign the visual features with the output of the 4th block of ResNet and then map the visual features to each layer corresponding to BERT to conduct image-text fusion.
HVPNeT-1T3: As ResNet and BERT have four blocks and 12 layers, respectively thus, it is intuitive to directly map visual features in one block to the three layers in BERT. We denote this variant as HVPNeT-1T3 to compare with our final version with hierarchical visual features.
HVPNeT-OnlyObj: Visual objects are considered as fine-grained image representations. We conduct ablation by only adopting the object-level features in this model to validate the effect of the object features.

Main Results
The experimental results of HVPNeT and all baselines on three testing sets are presented in Table 1. From the experimental results, we can observe that: Firstly, we can find that incorporating the visual features is generally helpful for NER and RE tasks by comparing the SOTA multimodal approaches with their corresponding text-based baselines. Despite previous multimodal approaches can generally achieve better performance, the enormous improvement of F1 score for NER is only about 2.0% (compare UMGF with BERT-CRF), which for RE is about 5.55% (compare MEGA with MTB). This observation reveals that the performance improve- ment of images on text-based NER tasks is relatively limited compared with RE tasks.
Secondly, our method is superior to the newest SOTA models UMGF and MEGA, which improves 1.36%, and 15.44% F1 scores for Twitter-2017, and MNRE datasets, respectively. It is worth noting that most of previous multimodal methods ignore the error sensitivity of irrelevant object-level images, while our method regard hierarchical visual prefix as a prompt for text. This results indicate that our method can effectively alleviate the error sensitivity irrelevant object images, which is a more robust method for visual enhanced NER and RE.
Finally, we also compare with VisualBERT, which is a pre-trained multimodal BERT with a single-stream structure. We notice that even as the pre-trained multimodal model, VisualBERT leaves much to be desired in MNER and MRE tasks, which performs worse than UMGF and MEGA, let alone our methods. We hold that VisualBERT is truly dissatisfactory since the datasets and pretraining process are less relevant to information extraction tasks.  Table 2: The first row shows the split of the relevance of image-text pairs, and the several middle rows indicate representative samples together with their entity-object attention in the test set of MNRE datasets (The y-axis represents the textual entites, and the x-axis denotes the visual objects with length of flattened 4 patches), and the bottom four rows show predicted relation of different approaches on these test samples.

Low-resource Scenario
We further conduct experiments in low-resource settings by randomly sampling 5% to 50% from the original training set to form a low-resource training set. Figure 3 shows the performance of our method in a low-resource scenario compared with several baselines. By analyzing this results, we can observe: 1) UMT and MEGA consistently outperform the compared baselines in the low-resource scenario; the improvement indicates that incorporating the visual features is still helpful for NER and RE tasks in low-resource scenarios. 2) Moreover, it can be observed that the performance of HVPNeT still outperforms the other baselines. It further proves the effectiveness and data-efficiency of our proposed method. Table 3 shows performance comparison of HVP-NeT and UMGF in a cross-task scenario for versatility analysis. For the first part, Twitter2017 → MNRE denotes that the trained model on Twitter-2017 is further used to train and test on MNRE. For the second part, MNRE → Twitter-2017 represents that the trained model on Twitter-2017 is used to further train and test on Twitter-2017. From this Table, we can observe that our HVPNeT significantly outperforms UMGF by a more considerable margin. Note that our method can achieve further improvement in a cross-task scenario, while UMGF performs worse than previous results on the corresponding dataset. This justifies that our HVPNeT is robust to automatically reduce the interference of visual information of irrelevant pictures; thus,  more image-text data may facilitate learning better parameters for modality fusion. Besides, it is also interesting to extend our work to multi-task learning or multi-modal pre-training and we leave these for future works.

Detailed Model Analysis
Ablation Study. In this part, we conduct extensive experiments with the variants of our model to further analyze the effectiveness of our model. We illustrate the results of the variant set in Table 1 . We can observe that: (1) Visual Prefix-guided Fusion. The core module of our HVPNeT is visual prefix-guided fusion, which is a pluggable operation. Therefore, ablating visual prefix-guided fusion is equivalent to a purely bert-based baseline model. As shown in Table 1, HVPNeT achieve significant improvements over purely bert-based baseline model, revealing the effectiveness of pluggable visual prefix-guided fusion.
(2) Hierarchical Visual Features. To validate the impact of our proposed hierarchical visual features, we carry out experiments by introducing two variants: 1) HVPNeT-Flat, crudely assign single visual feature for each layer of BERT; and 2) HVPNeT-1T3, intuitively leveraging visual fea-  Figure 4: Visualization of dynamic gate learned on MNER task. Each subgraph denotes one layer in BERT, and the ordinate and abscissa respectively represent the instance id in a batch and the block id of ResNet.
tures from low-level to high-level blocks. We observe that HVPNeT with hierarchical visual features achieves the best performance consistently compared with the other variants. Although the HVPNeT-1T3 performs slightly lower than the version of dynamic gate, it still outperforms the crude variant HVPNeT-Flat. It reveals that the dynamic gate can automatically learn appropriate weights for multi-scaled visual representations, enabling the model to learn good visual guidance for multimodal entity and relation extraction.
(3) Visual Clues Term. As recent SOTA models such as UMT, UMGF, and MEGA all adopt visual objects to enhance textual representation, we conduct experiments by ablating global images to explore the impact of the visual clues. As expected, we find that HVPNeT-OnlyObj performs slightly worse than HVPNeT, which is consistent with the observation of previous works. This can be attributed to that abstract clues maybe not be associated with the text in information extraction tasks. In other words, this empirical finding demonstrates the flexibility of our methods to infuse visual clues with different granularity.
Case Analysis for Error Sensitivity To validate the effectiveness and robustness of our method, we conduct case analysis for image-text relevance as indicated in Table 2. We notice that VisualBERT, MEGA, and our method can recognize the relation for the relevant image-text pair. We can further find that the attention between relevant entities and objects is significant. While in the situation that image represents the abstract semantic that is weak relevant to the text, only our method success in prediction due to HVPNeT captures the more hierarchical features. It should be noted that another two multimodal baselines fail in irrelevant image-text pairs while text-based BERT and ours still predict correctly. These observations reveal that our model regards visual prefix as a prompt for text may helps learn more robust multimodal representation, which is essential for the noise of uncorrelated object images.
Gate Visualization We argue that dynamic gated aggregation for hierarchical visual representation is another key component of HVPNeT achieving the superior performance. Specifically, the dynamic gated aggregation can adaptively assign different modality integration paths for different input images, thus, incorporating visual guidance with hierarchical multi-scaled information. To this end, we randomly sample eight images in a batch and visualize their gate vectors learned by HVPNeT according to 12 layers of BERT in Figure 4. Note that optimized gate vectors follow the trend of matching low-level textual semantics with low-level visual semantics and matching high-level textual semantics with high-level visual semantics. Meanwhile, the modality fusion obtained by dynamic gate learning may provide some valuable insights for efficient visual-language approaches in the future.

Conclusion and Future Work
In this paper, we propose a novel hierarchical visual prefix fusion neTwork (HVPNeT) for visualenhanced entity and relation extraction. To be specific, we present visual prefix-guided fusion by concatenating object-level visual representation as the prefix of each self-attention layer in BERT, which is a more soft and robust attention module for visual enhanced NER and RE. We further design leveraging hierarchical multi-scaled visual representation as visual guidance for fusion. Intuitively, Good Visual Guidance Makes A Better Extractor, and extensive experimental and results on three benchmarks have demonstrated the effectiveness and robustness of our proposed method. Meanwhile, our method also face the limitation that it is not suitable for mulimodal tasks in visual side, such as visual grounding.
In the future, we plan to 1) explore more applications of hierarchical visual prefix in multimodal representation learning, making it more flexible and extensible; 2) try to apply the reverse version of our approach to boost visual representation with text for CV; 3) extend our approach to multitask multimodal pre-training.

B Experimental Details
This section details the training procedures and hyperparameters for each of the datasets. We use the BERT-base-uncased model from hugging face library 3 . We follow UMGF  to revise some wrong annotations in the Twitter-2015 dataset. Considering the instability of the few-shot learning, we run each experiment 5 times on the random seed [1,49,1234,2021,4321] and report the averaged performance. We utilize Pytorch to conduct experiments with 1 Nvidia 3090 GPUs. All optimizations are performed with the AdamW optimizer with a linear warmup of learning rate over the first 10% of gradient updates to a maximum value, then linear decay over the remainder of the training. And weight decay on all non-bias parameters is set to 0.01. We set the number of image objects m to 3. We describe the details of the training hyper-parameters in the following sections.

B.1 Standard Supervised Setting
In the MNER task, we fix the batch size as 8 and search for the learning rates in varied intervals [1e-5, 3e-5]. We train the model for 30 epochs and do evaluation after the 16th epoch. In the MRE task, we fix the batch size as 32 and learning rates as 1e-5. We train the model for 12 epochs and do evaluation after the 8th epoch. In the two tasks, we all choices the model performing the best on the validation set and evaluate it on the test set.

B.2 Low-Resource Setting
For different instances per class, we sample five times on the random seed [1,2,49,4321,1234] and report the averaged performance. For all models, we fix the batch size as 8 and search for the learning rates in varied intervals [3e-5, 5e-5]. We train the model for 30 epochs and do evaluation after the 16th epoch. We choose the model performing the best on the validation set and evaluate it on the test set.

B.3 Cross-Task Setting
In the MNER task and RE task, we all use ResNet and BERT-base as the backbone, we transfer the same parameters except the classifier layer and CRF layer when we do cross-task. In further training, we fix the batch size as 8 and search for the learning rates in varied intervals [1e-5, 3e-5]. We train the model for12 epochs and do evaluation after the 8th epoch. We choose the model performing the best on the validation set and evaluate it on the test set.