Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction

Multi-Modal Relation Extraction (MMRE) aims at identifying the relation between two entities in texts that contain visual clues. Rich visual content is valuable for the MMRE task, but existing works cannot well model finer associations among different modalities, failing to capture the truly helpful visual information and thus limiting relation extraction performance. In this paper, we propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects, so as to mine more helpful information for the task, termed as DGF-PT. We first propose a prompt-based autoregressive encoder, which builds the associations of intra-modal and inter-modal features related to the task, respectively by entity-oriented and object-oriented prefixes. To better integrate helpful visual information, we design a dual-gated fusion module to distinguish the importance of image/objects and further enrich text representations. In addition, a generative decoder is introduced with entity type restriction on relations, better filtering out candidates. Extensive experiments conducted on the benchmark dataset show that our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.


Introduction
As a fundamental subtask of information extraction, relation extraction (RE) aims to identify the relation between two entities (Cong et al., 2022;Xue et al., 2022).Recently, there is a growing trend in multi-modal relation extraction (MMRE), aiming to classify textual relations of two entities as well as introduce the visual contents.It provides additional visual knowledge that incorporates multi-media information to support various cross-modal tasks such as the multi-modal knowledge graph construction (Zhu et al., 2022;Wang et al., 2019) and visual question answering systems (Wang et al., 2022;Shih et al., 2016).* Corresponding author.Existing methods achieved considerable success by leveraging visual information (Zheng et al., 2021a;He et al., 2022;Chen et al., 2022) since the visual contents provide valuable pieces of evidence to supplement the missing semantics for MMRE.Previous work (Zheng et al., 2021a) introduced the visual relations of objects in the image to enrich text embedding via an attention-based mechanism.Next, HVPNet (Chen et al., 2022) used an object-level prefix and a multi-scale visual fusion mechanism to guide the text representation learning.Nevertheless, these methods primarily focus on the relations between objects and text and ignore the finer associations (entity pair, text, and image/objects).Furthermore, they usually suffered from the failure of identifying truly helpful parts/objects of the image to the corresponding entity pair on account of introducing all the objects.This may cause severe performance degradation of downstream tasks.
For multi-modal relation extraction, not all images or their objects are helpful for prediction.As illustrated in Figure 1, given three different inputs with the same relation Member_of and entity pair, each of the inputs contains a text, an image, and an entity pair.There are two situations: (a) The image is helpful for relation extraction.For entity pair Le-Bron James and Lakers, in the image LeBron James wears the Lakers jersey revealing the implied relationship between the two entities.Therefore, we can improve relation extraction by considering the entity-pair relationships in visual information.(b) The image is unhelpful for the entity pair LeBron James and Lakers since it only contains Lakers object, rather than the association information for entity pairs.Furthermore, the image can provide an incorrect extraction signal, for example, the third image in Figure 1 shows that the relation between LeBron James and Lakers is more likely to be misjudged as Coach_of or Owner_of.Unhelpful visual content is prone to providing misleading information when predicting the relation.In general, it is necessary to identify the truly helpful visual information to filter the useless and misleading ones, but it is still under-explored.
To overcome the above challenges, we propose a novel MMRE framework DGF-PT to better incorporate finer granularity in the relations and avoiding unhelpful images misleading the model1 .Specifically, we propose a prompt-based autoregressive encoder containing two types of prefixtuning to integrate deeper associations.It makes the model focus on associations of intra-modal (between entity pair and text) by entity-oriented prefix and inter-modal (between objects and text) by the object-oriented prefix.In order to distinguish the importance of image/objects, we design a dualgated fusion module to address the unhelpful visual data by utilizing interaction information via local and global visual gates.Later, we design a generative decoder to leverage the implicit associations and restrict candidate relations by introducing entity type.We further design joint objective to allow the distribution of representations pre and postfusion to be consistent while enhancing the model to identify each sample in the latent space.Experimental results show that our approach achieves excellent performance in the benchmark dataset.Our contributions can be summarized as follows.
• We technically design a novel MMRE Framework to build deeper correlations among entity pair, text, and image/objects and distinguish helpful visual information.
• We propose a prompt-based autoregressive encoder with two types of prefixes to enforce the intra-modal and inter-modal association.We design dual-gated fusion with a local objectimportance gate and a global image-relevance gate to integrate helpful visual information.
• Experimental results indicate that the framework achieves state-of-the-art performance on the public multi-modal relation extraction dataset, even in the few-shot situation.

Related Work
Multi-modal relation extraction (MMRE) task, a subtask of multi-modal information extraction in NLP (Sun et al., 2021;Cong et al., 2020;Sui et al., 2021;Lu et al., 2022), aims to identify textual relations between two entities in a sentence by introducing visual content (Zheng et al., 2021a,b;Chen et al., 2022), which compensates for insufficient semantics and helps to extract the relations.
Recently, there are several works (Zheng et al., 2021a,b;Chen et al., 2022) beginning to focus on the multi-modal relation extraction technology.As the first work, MNRE (Zheng et al., 2021b) developed a multi-modal relation extraction baseline model.It demonstrates that introducing multimodal information supplements the missing semantics and improves relation extraction performance in social media texts.Later, Zheng et al. (2021a) proposed a multi-modal neural network containing a scene graph modeling the visual relations of objects and aligning the relations between objects and text via similarity and attention mechanism.HVPNet (Chen et al., 2022) designs a visual prefixguided fusion for introducing object-level visual information and further utilizes hierarchical multiscaled visual features.Moreover, they introduce the information of all objects and thus cannot distinguish the truly helpful visual information, making it impractical to personalize the use of the image and further damaging their performance.
In the multi-modal relation extraction task, we note that images are naturally helpful information in the problem of multi-modal relation extraction.However, the potential for a differentiated use of image information in this task is under-explored.In this paper, we focus on the finer (intra-modal and inter-modal) association and manage to integrate truly useful visual information, promoting the exploitation of limited images.This enables bridging the gap through the transfer multi-modal relation extraction task into MLM pre-train mechanism (Devlin et al., 2019;Liu et al., 2021).

Problem Formulation
We provide the definition of MMRE.For a given sentence text T = {w 1 , w 2 , . . ., w L } with L words as well as the image I related to the sentence, and an entity pair (e 1 , e 2 ), an MMRE model takes (e 1 , e 2 , T, I) as input and calculates a confidence score p(r i |e 1 , e 2 , T, I) for each relation r i ∈ R to estimate whether T and I can reflect the relation r i between e 1 and e 2 .R = {r 1 , . . ., r C , None} is a pre-defined relation set, where "None" means that no relation is maintained for the mentions.

Framework
This section introduces our proposed DGF-PT framework, as shown in Figure 2. We first design the prompt-based autoregressive encoder to acquire the fine-grained representations (entity pair, object, and text), which contains two types of prefixes for integrating helpful information and characterizing intra-modal and inter-modal interactions.To avoid unhelpful visual information misleading the model, we design the dual-gated fusion module to distinguish the importance of image/objects via local and global gates.It also integrates the semantics of image transferred by Oscar (Li et al., 2020).The Fusion module outputs an enhanced representation.Later, the generative decoder is proposed for relation prediction, leveraging the implicit associations and restricting candidate relations by introducing entity types.Finally, we design the joint objective including distribution-consistency constraint, self-identification enhancement, and relation classification for model optimization.

Prompt-based Autoregressive Encoder
In order to acquire the finer granularity in the associations (entity pair, objects, and text), we propose a prompt-based autoregressive encoder.After initialization, two specific prefix-tuning strategies are implemented to guide the encoder to attend to task-relevant inter-/intra-modal associations.Subsequently, prefixes, objects, image, and text are progressively fed into an autoregressive encoder stage by stage to obtain fine-grained representations for use in subsequent fusion module (Section 4.2).

Initialization
Given text T , the word embeddings w ∈ R 1×N are obtained through the GPT-2 model (Radford et al., 2019) and then fed into a fully-connection layer, where N is the word dimension.The initial text representation is T = [w 1 ; w 2 ; . . .; w L ] ∈ R L×N .
Given an image I, the global image feature I ∈ R M ×N is obtained by VGG16 (Simonyan and Zisserman, 2015) with a fully-connection layer, transferring the feature into M -block N -dimensional vectors.We then extract object features using Faster R-CNN (Ren et al., 2015) and select top-K objects using the ROI classification score.Each object feature is obtained by the average grouping of ROI regions.The initial object representation is

Object & Entity Oriented Prefixes
Utilizing the advantages of prefix-tuning, the pretrained encoder (e.g., GPT-2 as our encoder) can be guided to learn task-specific features for fast adaptation to the MMRE task (Liu et al., 2021).However, the design of appropriate prefixes for finer associa-tions learning in the MMRE task remains an open research question, and the direct use of prefixes from other tasks is not reasonable.Therefore, we construct two types of prefixes, an object-oriented prefix for inter-modal relevance (objects & text) and an entity-oriented prefix for intra-modal correlations (entity pair & text), encouraging the encoder to leverage text as a medium to strengthen multigranular associations to acquire enhanced semantic representations.
Object-Oriented Prefix.Given that objects related to entities are indeed useful information for the MMRE task, we propose an object-oriented prefix, termed as P o (•), which provides guidance information of inter-modal relevance to the encoder.For the input text T , we define the following pattern "Consider ⟨objects⟩, predict relation.",where ⟨objects⟩ means the objects relevant to the entity pair of T which is different for each input.It emphasizes specific key textual contents and introduces the visual features of relevant objects.
Entity-Oriented Prefix.Due to the visual information may be incomplete or misleading, we argue that only an object-oriented prefix is insufficient to capture classification information.Thus, we propose an entity-oriented prefix, termed as P e (•), to capture intra-modal association to adapt the task.We define the following pattern "Consider ⟨e 1 , e 2 ⟩, predict relation.",where ⟨e 1 , e 2 ⟩ is the entity pair to predict the relation.

Multi-Stage Autoregressive Encoder
Prompt-based learning keeps the parameters of the whole PLM frozen and prepends the prefixes before the task inputs (Liu et al., 2021).The bidirectional encoder (e.g., BERT) cannot effectively integrate the proposed dual-gated fusion module (Section 4.2) in model testing.Therefore, we deploy a unidirectional encoder (e.g., GPT and GPT-2) and design multiple stages to integrate multi-granular textual and visual knowledge, where the prefixes, objects, image, and text are fed stage by stage.
First stage (S 1 ).The input of the first stage S 1 contains the prefixes and objects to learn the relevance from local granularity and obtain the representations of objects.To introduce task-related prefix knowledge, the two types of trainable prefixes P o (•) and P e (•) are prepended before the input sequence as the prefix tokens, obtained through the GPT-2 vocabulary.In S 1 , the encoder learns the representations of objects and updates the prefix tokens of each model layer: (1) where P * o (•) and P * e (•) are updated prefixes after S 1 , h o is the representations of objects, and o e 1 and o e 1 are the initial embeddings of entities e 1 and e 2 .After the S 1 stage, the object information is introduced into the prefix embedding.Second stage (S 2 ).The inputs of the second stage S 2 are the outputs of the first stage S 1 (including the updated prefixes and the representations of objects) and the image feature I to get the representations of images h i .We hope the model can learn to capture the inter-modal relevance from global granularity, which is useful information for relation extraction that may improve performance.Thus, we introduce the image information in S 2 .The S 2 embedding is therefore updated as: (2) Third stage (S 3 ).To learn text representation h t , the third stage inputs S 2 and text T using interactive objects and images. (3)

Dual-Gated Fusion
Unhelpful/Task-irrelevant information in image is often ignored by simply utilizing all objects for aggregation.To solve this, we propose dual-gated fusion to effectively integrate helpful visual information while filtering out misleading information.This module utilizes local and global gates to distinguish the importance and relevance of image/objects, and filters out task-irrelevant parts.By integrating semantic information of the image, a final fused representation containing associations among image, object, and text is obtained.Specifically, the local object-importance gate vector β by the local object features and the global image-relevance gate vector γ by the global image features are calculated as: where FC is fully-connected layer and h o [k] is the k-th object in the object set O. h o calculates attention between the selected top-K objects of divergent modalities.Subsequently, the textual characteristic of fusion ht is calculated by where MLP is multilayer perceptron and ⊙ means hadamard product.
In order to further integrate the semantics of visual information, we use Oscar (Li et al., 2020) to transfer h i into a text description hi2t for each image, using objects as anchor points to align visual and textual features in a common space.It learns multi-modal alignment information of entities from a semantic perspective.The detail is given in Appendix A.
While local representations can capture valuable clues, global features provide condensed contextual and high-level semantic information.Given this insight, we leverage the global information from one modality to regulate the local fragment of another modality, enabling the entity to contain semantic information and filter out irrelevant visual information.The final fused representation is: where δ is trade-off factor between text embedding ht and the inter-modal text representation hi2t .

Generative Decoder
To leverage the implicit associations and restrict candidate relations by introducing entity type, we design a generative decoder.The type of entity pair is helpful for relation classification.For example, the relation of entity type Person and Organization must not be born and friend, but maybe CEO and staff.Thus, we introduce head type T t e 1 and tail type T t e 2 one by one to leverage the implicit associations and restrict candidate relations.
To maintain the consistency of the relation extraction task with the MLM pre-trained model, we use the generative decoder to predict the relation.The prediction of the generative decoder is: where h t e 1 , h t e 2 are the representation of types, and r is the representation of relation.

Joint Objective
In order to address distribution consistency within the dual-gated fusion module, we introduce the distribution-consistency constraint loss, which is applied on a single-sample basis.Additionally, to meet the need for inter-sample identification, we propose self-identification enhancement loss.The overall joint objective is then formed by combining the relation classification loss with the aforementioned constraints.
Distribution-Consistency Constraint.In order to ensure the dual-gated fusion module effectively integrates helpful visual features while avoiding the introduction of task-irrelevant information, we introduce distribution-consistency constraint to measure and optimize the change in representation distribution pre and post-fusion.Thus, we propose to use KL divergence to measure the distance between the probability distribution of ht and h t , which is equal to calculating the cross-entropy loss over the two distributions: Self-Identification Enhancement.The MMRE task requires the model to have the ability to correctly classify relations from individual samples.However, relation labels are unevenly distributed or lacking in the real world.Therefore, further enhancement is needed.We design a negativesampling-based self-supervised loss function to enhance the model.Moreover, the dual-gated fusion module is treated as the augmentation function leveraging the modality information.Specifically, textual representation h t and fused representation ht are the mutually positive samples:  where r is the relation between the head entity e 1 and the tail entity e 2 for h x , and h t [e 1 ], h t [e 2 ] are the representations of the two entities.Finally, the overall loss function of our model is as follows.
where λ d , λ s , and λ c are trade-off parameters.We optimize all training inputs in a mini-batch strategy.

Dataset and Evaluation Metric
We conduct experiments in a multi-modal relation extraction dataset MNRE (Zheng et al., 2021b) (Tang et al., 2020) to learn the connection between text and the object of the image.( 5) BERT+SG+Att adopts an attention mechanism to compute the relevance between the textual and visual features.(6) Visual-BERT (Li et al., 2019) is a single-stream encoder, learning cross-modal correlation in a model.( 7) MEGA (Zheng et al., 2021a) considers the relevance from the structure of objects in the image and semantics of text perspectives with graph alignment.( 8) HVPNet (Chen et al., 2022) introduces an object-level prefix with a dynamic gated aggregation strategy to enhance the correlation between all objects and text.
In contrast to these methods, our approach incorporates the correlation between entity pairs, text, and visual information, and effectively identifies useful visual information.

Implementation Details
For all baselines, we adopt the best hyperparameters and copy results reported in the literature (Zheng et al., 2021a,b;Chen et al., 2022).
We used PyTorch 3 as a deep learning framework to develop the MMRE.The BERT and GPT-2 4 are for text initialization and the dimension is set at 768.The VGG version is VGG16 5 .We use Faster R-CNN (Ren et al., 2015) for image initialization and set the dimension of visual objects features at 4096.For hyper-parameters, the best coefficients λ d , λ s , λ c are 2, 2 and 3.The best δ is 0.4.See Appendix B for more details on model training.

Main Results
To verify the effectiveness of our model, we report the overall average results in Table 1.
From the table, we can observe that: 1) Our model outperforms text-based RE models in terms of four evaluation metrics, indicating the beneficial impact of visual information on relation extraction and the necessity of its integration.2) Compared to MMRE baselines, our model achieves the best results.Specifically, our model improves at least 2.62% in F1 and 8.10% in Acc., respectively.These results indicate that our method for incorporating and utilizing visual information is superior and effective.3) Compared to different encoders (e.g., BERT and GPT), the GPT and GPT-2 achieve better results.It demonstrates that the generative encoder can integrate effective visual features more effectively, which is more suitable for the task.For the generative model, the performance is sensitive to the order of input.Thus, we discuss the effect of the order of text, image, and objects in Appendix C.

Discussion for Model Variants
For a further detailed evaluation of the components of our framework, we performed ablation experiments and reported the results in Table 2. E-P means entity-oriented prefix and O-P means objectoriented prefix."↓" means the average decrease of all four metrics compared to our model.
Discussions for core module.To investigate the effectiveness of each module, we performed variant experiments, showcasing the results in Table 2. From the table, we can observe that: 1) the impact of the prefixes tends to be more significant.We believe the reason is that the multiple prompts characterize modality interactions, helping for providing more visual clues.2) By removing each module, respectively, the performance basically decreased.Compared to joint objective modules, the dual-gated fusion is significantly affected.It demonstrates the effectiveness of knowledge fusion introducing useful visual content and addressing Discussions for the stage of prefix.We explore the effects by introducing the prefixes at a different stage in the encoder, as shown in Table 2. From the table, we can observe that: 1) Compared to feed all Prefixes in the S 2 , S 3 stage, the S 1 stage is more effective.It demonstrates that early introducing prefixes may integrate more helpful visual classification information.2) When the O-P is fixed, and the E-P is fed into the S 3 stage, the performance of our model is the best compared to that introduced in S 2 stages.It demonstrates that the E-P is set nearly to the text features helpful to introduce intramodal association. 3) When we fix the E-P, and the O-P is introduced in the S 2 stage, which is fed nearly to objects, the performance achieves better than S 3 .It demonstrates that the O-P is nearly to object features, which can capture more useful local information for relation classification.4) When we change the stage of the two prefixes, the performance achieves better as the E-P in S 3 and the O-P in S 2 .All observations demonstrate that "E-P in S 1 & O-P in S 1 " is the best schema to introduce intra-modal association and inter-modal relevance.

Discussions for Image Information
To further investigate the impact of images on all the compared methods, we report the results by deleting different proportions of images, as shown in Figure 3. From the figure we can observe that: 1) On all metrics, the more proportion of images Observations indicate that our model incorporates visual features more effectively.

Discussions for Sample Number
We investigate the impact of the sample number of different relations.To do so, we divide the dataset into multiple blocks based on the sample number of each relation and evaluate the performance by varying the sample number of relations in [0, 1000] compared with the outstanding baselines, as shown in Figure 4. From the figure, we can observe that: 1) The increasing of sample number performance improvements to all methods.The main reason is that the smaller the sample number, the more difficult it is to distinguish the relation.2) Our model could also advance the baseline methods with the decrease in sample number, demonstrating the superiority of our method in tackling the relation with fewer samples.This phenomenon confirms the prefixes are suitable for few-shot situations.All the observations demonstrate that our method reduces the impact of the sample number.

Case Study
To illustrate our model can effectively identify useful visual information, we provide an example involving various entity pairs.As shown in Fig. 5, the helpful information varies depending on the entity pair.From the figure, we can observe that: 1) Our model achieves superior performance across different entity pairs, demonstrating its ability to effectively extract useful visual information while avoiding the negative influence of unhelpful information on prediction.2) When presented with the entity pair of Vera and P.Wilson that contains limited useful visual information, our model remains the best, while other baselines make incorrect predictions.These observations further demonstrate the effectiveness of our model in leveraging visual information while avoiding the negative influence of unhelpful information on predictions.

Conclusion
We propose DGF-PT, a novel multi-modal relation extraction framework, to capture deeper correlations among entity pair, text, and image/objects and integrate more helpful information for relation extraction.Our framework effectively integrates intra-modal and inter-modal features, distinguishes helpful visual information, and restricts candidate relations.Extensive experiments conducted on the benchmark dataset show that our approach achieves excellent performance.

Limitations
Our work overcomes visual noise data that limit extraction performance, incorporating multi-modal knowledge of different levels.Empirical experiments demonstrate that our method avoids noise data misleading the MMRE model.However, there are still some limitations of our approach can be summarized as follows: • Due to the limitation of the existing MMRE datasets, we only experiment on two modalities to explore the influence of image features.We will study more modalities in future work.
• Our method neglects the multiple relations for an input, which may not consider the multiple semantics of entities.We leave the multiple relation extraction method for future work.

Ethics Statement
In this work, we propose a new MMRE framework that captures deeper correlations and fuses helpful visual information to benchmark our architecture with baseline architectures on the MNRE dataset.
Data Bias.Our framework is designed for multimodal relation extraction for Twitter data.However, when applied to data with vastly different distributions or in new domains, the model's performance may be biased.The results reported in the experiment section are based on specific benchmark datasets, which may be affected by these biases.Therefore, caution should be taken when evaluating the generalizability and fairness.
Computing Cost/Emission.Our research, which entails the utilization of large language models, necessitates a significant computational burden.We recognize that this computational burden results in a negative environmental impact in terms of carbon emissions.Specifically, our work required a cumulative 425 GPU hours of computation utilizing Tesla V100 GPUs.The total emissions generated by this computational process are estimated to be 47.18 kg of CO 2 per run, with a total of two runs being performed.

A Oscar for Image Caption Generation
To generate the text description of the image for multi-modal knowledge alignment without additional pre-training on multi-modal relation extraction, we directly utilize the image captioning method, generating a natural language description of the content of an image.In this paper, we use Oscar (Object-Semantics Aligned Pre-training) (Li et al., 2020) to transfer the image into a text description for each image, which integrates multi-modal alignment information of entities from a semantic perspective.
Oscar uses object tags detected in images as anchor points to significantly facilitate alignment learning.Input samples are processed into triples involving image region features, captions, and object tags similar to the pre-training.It randomly masks 15% of caption tokens and uses the corresponding output representations to perform a classification to predict tokens.Similarly to VLP (Zhou et al., 2020), the self-attention mask is constrained so that a caption token can only attend to the tokens before its position to simulate a uni-directional generation process.It eases the learning of semantic alignments between images and texts on the public corpus of 6.5 million text-image pairs, creating new state-of-the-art on the image caption task.Thus, we use Oscar to integrate useful images by transferring them into textual descriptions.

B Hyper-parameter Settings
Our implementation is based on PyTorch 6 .All experiments were carried out on a server with one GPU (Tesla V100).For re-implementation, we report our hyper-parameter settings on the dataset in Table 3.Note that the hyper-parameter settings are tuned in the validation data by grid search with 5 trials.The learning rate is 2e − 4, the batch size is 100, and the dropout rate is 0.6.We use AdamW (Loshchilov and Hutter, 2019) to optimize the parameters.The maximum length of the text is 128 and the objects of each image are 10.For the learning rate, we adopt the method of grid search with a step size of 0.0001.

C Discussions for Input Order
Due to utilizing a generative encoder, where the prefix, object, image, and text are input stage by stage, thus the order affects the performance of D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: An example of the MMRE task.The task is to predict the relation of given entity pairs for the specific text and image which contains multiple objects.
) where {x, x} are {h t , ht }, [a] + = max(a, 0), and s(•, •) is the cosine similarity.x n and xn are the hardest negatives of h t and ht in a mini-batch based on a similarity-based measurement.Relation Classification.The loss for relation classification by the negative log-likelihood function is as follows:

Figure 4 :Figure 5 :
Figure 4: Impact of differences in sample number.

Table 1 :
Main experiments.The best results are highlighted in bold, "-" means results are not available, and the underlined values are the second-best result."↑" means the increase compared to the underlined values.

Table 2 :
& O-P in S 3 84.0983.43 82.81 84.15 ↓ 0.86 repl.E-P in S 3 & O-P in S 2 84.76 84.24 83.38 84.20 ↓ 0.33 Variant experiments on different orders to introduce the two prefixes."w/o" means removing the corresponding module from the complete model."repl."means replacing the stage of introducing prefix.noise visual data.All observations demonstrate the effectiveness of each component in our model.

Table 4 :
Impact of the input order of the image I i , objects I o , and text I t .themodel.As shown in Table4, we exploit the best input order for multi-modal relation extraction.From the figure, we can observe that: 1) Our model is affected by the input order of text, image, and objects.The reason we think that prompt-based autoregressive encoder is a more efficient way to integrate multi-grained information.2)Thebestinputorder is I o → I i → I t .Furthermore, when the text I t is input before others, the performance of our model dramatically decreases.It demonstrates that visual information is fed before textual information usually integrating more helpful extraction knowledge.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?In Section 5 and Appendix C.C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?In Section 5.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?In Section 5. D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.