Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling

Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation. To combat that, we propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting. First, we represent the fine-grained semantic structures of the input image and text with the visual and textual scene graphs, which are further fused into a unified cross-modal graph (CMG). Based on CMG, we perform structure refinement with the guidance of the graph information bottleneck principle, actively denoising the less-informative features. Next, we perform topic modeling over the input image and text, incorporating latent multimodal topic features to enrich the contexts. On the benchmark MRE dataset, our system outperforms the current best model significantly. With further in-depth analyses, we reveal the great potential of our method for the MRE task.


Introduction
Relation extraction (RE), determining the semantic relation between a pair of subject and object entities in a given text (Yu et al., 2020), has played a vital role in many downstream natural language processing (NLP) applications, e.g., knowledge graph construction (Wang et al., 2019;Mondal et al., 2021), question answering (Cao et al., 2022b). But in realistic scenarios (i.e., social media), data is often in various forms and modalities (i.e., texts, images), rather than pure texts. Thus, multimodal relation extraction has been introduced recently (Zheng et al., 2021b), where additional visual sources are added to the textual RE as an enhancement to the relation inference. The essence of a successful MRE lies in the effective utilization of multimodal information. Certain efforts have been made in existing MRE work and achieved promising performances, where delicate interaction and fusion mechanisms are designed for encoding the multimodal features (Zheng et al., 2021a;Chen et al., 2022b,a). Nevertheless, current methods still fail to sufficiently harness the feature sources from two information perspectives, which may hinder further task development.
Internal-information over-utilization. On the one hand, most existing MRE methods progressively incorporate full-scale textual and visual sources into the learning, under the assumption that all the input information certainly contributes to the task. In fact, prior textual RE research extensively shows that only parts of the texts are useful to the relation inference (Yu et al., 2020), and accordingly propose to prune over input sentences (Zhang et al., 2018). The case is more severe for the visual inputs, as not all and always the visual sources play positive roles, especially on the social media data. As revealed by Vempala and Preoţiuc-Pietro (2019), as high as 33.8% of visual information serves no context or even noise in MRE. Xu et al. (2022) thus propose to selectively remove images from the input image-text pairs. Unfortunately, such coarse-grained instance-level filtering largely hurts the utility of visual features. We argue that a fine-grained feature screening over both the internal image and text features is needed. Taking the example #1 in Fig. 1, the textual expressions 'Congratulations to Angela and Mark Salmons' and the visual objects of 'gift' and 'roses' are valid clues to infer the 'couple' relation between ' Angela' and 'Mark Salmons', while the rest of text and visual information is essentially the task-irrelevant noise.
External-information under-exploitation. On the other hand, although compensating the text inputs with visual sources, there can be still information deficiency in MRE, in particular when the visual features serve less (or even negative) utility. This is especially the case for social media data, where the contents are less-informative due to the short text lengths and low-relevant images (Baly et al., 2020). For the example #2 in Fig. 1, due to the lack of necessary contextual information, it is tricky to infer the relation 'present in' between 'Hot summer' (an album name) and 'Migos' (a singer name) based on both the image and text sources. In this regard, more external information should be considered and exploited for MRE. Fortunately, the topic modeling technique offers a promising solution, which has been shown to enrich the semantics of the raw data, and thus facilitate NLP applications broadly . For the above same example, if an additional 'music' topic feature is leveraged into the context, the relation inference can be greatly eased.
Taking into account the above two observations, in this work, we propose a novel framework to improve MRE. As shown in Fig. 4, we first employ the scene graphs (SGs) (Johnson et al., 2015) to represent the input vision and text, where SGs advance in intrinsically depicting the fine-grained semantic structures of texts or images. We fuse both the visual and textual SGs into a cross-modal graph (CMG) as our backbone structure. Next, we reach the first goal of internal-information screening by adjusting the CMG structure via the graph information bottleneck (GIB) principle (Wu et al., 2020), i.e., GIB-guided feature refinement, during which the less-informative features are filtered and the task-relevant structures will be highlighted. Then, to realize the second goal of external-information exploiting, we perform multimodal topic integration. We devise a latent multimodal topic module to produce both the textual and visual topic features based on the multimodal inputs. The multimodal topic keywords are integrated into the CMG to enrich the overall contexts, based on which we conduct the final reasoning of relation for input.
We perform experiments on the benchmark MRE dataset (Zheng et al., 2021a), where the results show that our framework significantly boosts the current state of the art. Further analyses demonstrate that the GIB-guided feature refinement helps in effective input denoising, and the latent multimodal topic module induces rich taskmeaningful visual&textual topic features as extended contexts. We finally reveal that the idea of internal-information screening is especially important to the scenario of higher text-vision relevance, while the external-information exploiting particularly works for the lower text-vision relevance case.
To sum up, this work contributes by introducing a novel idea of simultaneous information subtraction and addition for multimodal relation extraction. The internal-information over-utilization and external-information under-exploitation are two common co-existing issues in many multimodal applications, to which our method can be broadly applied without much effort.

Textual and Visual Scene Graph
There have been the visual scene graph (VSG) (Johnson et al., 2015) and textual scene graph (TSG) (Wang et al., 2018), where both of them include three types of nodes: object node, attribute node, and relationship node. All the nodes come with a specific label text, as illustrated in Fig. 2.
In an SG, object and attribute nodes are connected with other objects via pairwise relations. As intrinsically describing the semantic structures of scene contexts for the given texts or images, SGs are widely utilized as types of external features integrated into downstream applications for enhancements, e.g., image retrieval (Johnson et al., 2015), image generation (Johnson et al., 2018) and image captioning (Yang et al., 2019). We also take advantage of these SG structures for better cross-modal semantic feature learning. Formally, we define a scene graph as G=(V, E), where V is the set of nodes, and E is the set of edges.

Graph Information Bottleneck Principle
The information bottleneck (IB) principle (Alemi et al., 2017) is designed for information compression. Technically, IB learns a minimal feature Z to represent the raw input X that is sufficient to infer the task label Y . Further, the graph-based IB has been introduced for the graph data modeling (Wu et al., 2020), i.e., by refining a raw graph G into an informative yet compact one G − , by optimizing: where I(G − , G) minimizes the mutual information between G and G − such that G − learns to be the minimal and compact one of G. I(G − , Y ) is the prediction objective, which encourages G − to be informative enough to predict the label Y . β is a Lagrangian coefficient. We will employ the GIB principle for internal-information screening.

Latent Multimodal Topic Modeling
We introduce a latent multimodal topic (LAMO) model. Technically, we first represent the input text T with a bag-of-word (BoW) feature b T , and represent image I with a visual BoW (VBoW) 1 b I . The topic generative process is described as follows: • Draw a topic distribution θ ∼ N (µ, σ).
• For each word token w T i and visual token w I j : . where µ and σ are the mean and variance vector for the posterior probability p(θ|T, I). χ ∈ R K×U T and ψ ∈ R K×U I are the probability matrices of textual topic-word and visual topic-word, respectively. K is the pre-defined topic numbers, and U T and U I are textual and visual vocabulary size.
As depicted in Fig. 3, µ and σ are produced from a cross-modal feature encoder upon 1 Note that visual topic words are visual objects.  T and I. The topic distribution is yielded via θ=Softmax(µ+σ · ε), where ε ∈ N (0, I). Then, we autoregressively reconstruct the input b T and b I based on θ: Then, with the activated k-th topic (via argmax over θ), we obtain the distributions of the textual and visual topic words by slicing the χ[k, :] ∈ R U T and ψ[k, :] ∈ R U I . As shown in Fig. 3, the objective of topic modeling is derived as follows: Appendix §A.4 extends the description of LAMO.

MRE Framework
As shown in Fig. 4, our overall framework consists of five tiers. First, the model takes as input an image I and text T , as well as the subject v s and object entity v o . We represent I and T with the corresponding VSG and TSG. Then, the VSG and TSG are assembled as a cross-modal graph, which is further modeled via a graph encoder. Next, we perform GIB-guided feature refinement over the CMG for internal-information screening, which results in a structurally compact backbone graph. Afterwards, the multimodal topic features induced from the latent multimodal topic model are integrated into the previously obtained feature representation for external-information exploitation. Finally, the decoder predicts the relation label Y based on the enriched features.

Scene Graph Generation
We employ the off-the-shelf parsers to generate the VSG (i.e., G I =(V I , E I )) and TSG (i.e., G T =(V T , E T )), respectively. We denote the representations of VSG nodes as X I ={x I 1 , · · · , x I n }, where each node embedding x I i is the concatenation of the object region representations and the corresponding node label embeddings. We directly represent the TSG nodes as X T ={x T 1 , · · · , x T m }, where each x T j is the contextualized word embedding. Note that both visual objects and text token representations are obtained from the CLIP (Radford et al., 2021) encoder, which ensures an identical embedding space across the two modalities.

Cross-modal Graph Construction
Next, we consider merging the VSG and LSG into one unified backbone cross-modal graph (CMG).
is the set of edges, including the intra-modal edges (E I and E T ), and inter-modal hyper-edges E × . To build the cross-modal hyper-edges between each pair of VSG node v I i and TSG node v T j , we measure the relevance score s in between: is larger than a pre-defined threshold λ. Node representations from VSG and TSG are copied as the CMG's node representations, i.e., X=X T ∪ X I . We denote each edge e i,j (∈ E)=1 if there is an edge between nodes, and e i,j =0 and vice versa. Next, a graph attention model (GAT; Velickovic et al., 2018;Fei et al., 2022e,d) is used to fully propagate the CMG:

GIB-guided Feature Refinement
In this step, we propose a GIB-guided feature refinement (GENE) module to optimize the initial CMG structure such that we fine-grainedly prune the input image and text features. Specifically, with the GIB guidance, we 1) filter out those taskirrelevant nodes, and 2) adjust the edges based on their relatedness to the task inference.
Node Filtering We assign a 0 or 1 value ρ While the sampling is a discrete process, we make it differentiable via the concrete relaxation method (Jang et al., 2017;Fei et al., 2022f): where τ is the temperature, ∼ Uniform(0, 1). We estimate π v i by considering both the v i 's l-order context and the influence of target entity pair: where Att(·) is an attention operation, ϕ(v i ) is the l-order neighbor nodes of v i , h s and h o are the representations of the subject and object entity.
Edge Adjusting Similarly, we take the same sampling operation (Eq. 7) to generate a signal ρ e i,j for any edge e i,j , during which we also consider the l-order context features and the target entity pair: where ϕ(v i ) and ϕ(v j ) are the l-order neighbor nodes of v i and v j . Instead of directly determining the existence of e i,j with ρ e i,j , we also need to take into account the existences of v i and v j , i.e., (ρ e i,j · ρ v i · ρ v j ) e i,j . Because even if ρ e i,j =1, an edge is non-existent when its affiliated nodes are deleted.
Thereafter, we obtain an adjusted CMG, i.e., G − , which is further updated via the GAT encoder, resulting in new node representations H − . We apply pooling operation on H − to obtain the overall graph presentation g, which is concatenated with two entity representations as the context feature a: GIB Optimization To ensure that the aboveadjusted graph G − is sufficiently informative (i.e., not wrongly pruned), we consider a GIB-guided optimization. We denote z as the compact information of the resulting G − , which is sampled from a Gaussian distribution parameterized by a. Then, we rephrase the raw GIB objective (Eq. 1) as: The first term −I(z, Y ) can be expanded as: where q(Y |z) is a variational approximation of the true posterior p(Y, z). For the second term I(z, G), we estimate its upper bound via reparameterization trick (Kingma and Welling, 2014;Fei et al., 2020c): We run GENE several iterations for sufficient refinement. In Appendix §A.3 we detail all the technical processes of GIB-guided feature refinement.

Multimodal Topic Integration
We further enrich the compressed CMG features with more semantic contexts, i.e., the multimodal topic features. As depicted in Sec. §2.3, our LAMO module takes as input the backbone CMG representation H and induces both the visual and textual topic keywords that are semantically relevant to the input content. Note that we only retrieve the associated top-L textual and visual keywords, separately. Technically, we devise an attention operation to integrate the embeddings of the multimodal topic words (u T and u I , from CLIP encoder) into the resulting feature representation z of GENE: We finally summarize these three representations as the final feature:  of our overall framework is based on a warm-start strategy. First, GENE is trained via L GIB (Eq. 11) for learning the sufficient multimodal fused representation in CMG, and refined features from compacted CMG. Then LAMO module is unsupervisedly pre-trained separately via L LAMO (Eq. 4) on the well-learned multimodal fused representations so as to efficiently capture the task-related topic.
Once the two modules have converged, we train our overall framework with the final cross-entropy task loss L CE (Ŷ , Y ), together with the above two learning loss: 4 Experiment

Setting
We experiment with the MRE dataset 2 , which contains 9,201 text-image pairs and 15,485 entity pairs with 23 relation categories. The statistical information of the MRE dataset is listed in Table 1.
Note that a sentence may contain several entity pairs, and thus a text-image pair can be divided into several instances, each with only one entity pair. We follow the same split of training, development, and testing, as set in Zheng et al. (2021a). We compare our method with baselines in two categories: 1) Text-based RE methods that traditionally leverage merely the texts of MRE data, including, BERT ( We use the pre-trained language-vision model CLIP (vit-base-patch32) to encode the visual and textual inputs. The threshold value λ is set to 0.25; the temperature τ is 0.1; and β is set to 0.01. All the dimensions of node representations and GAT  hidden sizes are set as 768-d. We utilize the 2-order (i.e., l = 2) context of each node to refine the nodes and edges of CMG. Following existing MRE work, we adopt accuracy (Acc.), precision (Pre.), recall (Rec.), and F1 as the major evaluation metrics.  First of all, we see that both the GENE and LAMO modules show big impacts on the results, i.e., exhibiting a drop in F1 by 1.91% and 1.94% F1, respectively. This confirms their fundamental contributions to the whole system. More specifically, the GIB guidance is key to the information refinement in GENE, while both the textual and visual topic features are key to LAMO. Also, it is critical to employ the SG for the structural modeling of the multimodal inputs. And the proposal of the cross-modal graph is also helpful to task modeling.

Analysis and Discussion
To gain a deeper understanding of how our proposed methods succeed, we conduct further analyses to answer the following questions.
RQ1: Does GENE helps by really denoising the input features?
A: We first investigate the working mechanism of GENE on internal-information screening. We plot the trajectories of the node filtering and the edge adjusting, during which we show the changing trends of overall performances and the mutual information I(G − , G) between the raw CMG (G) and the pruned one (G − , i.e., z). As shown in Fig.  5, along with the training process both the number of nodes and edges decrease gradually, while the task performance climbs steadily as I(G − , G) declines. These clearly show the efficacy of the task-specific information denoising by GENE.
RQ2: Are LAMO induced task-relevant topic features beneficial to the end task?
A: Now, we consider visualizing the learned contextual features in our system, including the z without integrating the topic features, and the s with rich topic information injected. We separately   project z and s into the ground-truth relational labels of the MRE task, as shown in Fig. 6. We see that both z and s have divided the feature space into several clusters clearly, thanks to the GIB-guided information screening. However, there are still some wrongly-placed or entangled instances in z, largely due to the input feature deficiency. By supplementing more contexts with topic features, the patterns in s become much clearer, and the errors reduce. This indicates that LAMO induces topic  Table 3 we show the top 10 latent topics with both the textual and visual keywords, where we notice that the latent topic information is precisely captured and modeled by LAMO. Further, we study the variance of the latent topics in two modalities, exploring the different contributions of each type. Technically, we analyze the numbers of the imported topic keywords of textual and visual ones respectively, by observing the attention weights α T /I i (Eq. 14). In Fig. 7 we plot the distributions. It can be found that the  model tends to make use of more textual contexts, compared with the visual ones.
RQ3: How do GENE and LAMO collaborate to solve the end task?
A: As demonstrated previously, the GENE is able to relieve the issue of noisy information, and LAMO can produce latent topics to offer additional clues for relation inference. Now we study how these two modules cooperate together to reach the best results. First, we use the learned feature c * to calculate task entropy − p(Y |c * ) log p(Y |c * ), where lower entropy means more confidence of the correct predictions. We compute the entropy using H (initial context feature), using z (with denoised context feature) and using s (with feature denoising and topic enriched context), respectively, which represents the three stages of our system, as shown in Fig. 9. As seen, after the information denoising and enriching by GENE and LAMO respectively, the task entropy drops step by step, indicating an effective learning process with the two modules.
We further empirically perform a case study to gain an intuitive understanding of how the two modules come to play. In Fig. 8 we illustrate the two testing instances, where we visualize the constructed cross-model graph structures, the refined graphs (G − ) and then the imported multimodal topic features. We see that GENE has fine-grainedly removed those noisy and redundant nodes, and adjusted the node connections that are more knowledgeable for the relation prediction. For example, in the refined graph, the task-noisy visual nodes, 'man' and textual nodes, 'in', 'plans' are removed, and the newly-generated edges (e.g., 'Trump'→'US', and 'Broncos'→'football') allow more efficient information propagation. Also, the model correctly paid attention to the topic words retrieved from LAMO that are useful to infer the relation, such as 'president', 'leader' in case #1, and team', and 'football' in case #2.

RQ4: Under what circumstances do the internalinformation screening and external-information exploiting help?
A: In realistic scenarios, a wide range of multimodal tasks is likely to face the issues of internal-information over-utilization and externalinformation under-exploitation (or simultaneously). Especially for the data collected from the online web, the vision and text pairs are not well correlated. Finally, we take one step further, exploring when our idea of internal-information screening and external-information exploiting aids the tasks in such cases. Technically, we first measure the vision-language relevance Ψ of each image-text pair by matching the correspondence of the VSG and TSG structures. And then, we group the instances by their relevance scores, and finally make predictions for different groups. From Fig. 10, it is observed that for the inputs with higher textvision relevance, the GENE plays a greater role than LAMO, while under the case with less crossmodal feature relevance, LAMO contributes more significantly than GENE. This is reasonable because most of the high cross-modal relevance input features come with rich yet even redundant information, where the internal-information screening is needed for denoising. When the input text-vision sources are irrelevant, the exploitation of external features (i.e., latent topic) can be particularly useful to bridge the gaps between the two modalities. On the contrary, MKGformer performs quite badly especially when facing with data in low visionlanguage relevance. Integrating both the LAMO and GENE, our system can perform consistently well under any case.

Related Works
As one of the key subtasks of the information extraction track (Fei et al., 2020a(Fei et al., ,b, 2021a, relation extraction (RE) has attracted much research attention (Fei et al., 2021c,d;Cao et al., 2022a;Chen et al., 2022c;Guo et al., 2023). The recent trend of RE has shifted from the traditional textual RE to the recent multimodal RE, where the latter additionally adds the image inputs in the former one for better performances, under the intuition that the visual information can offer complementary features to the purely textual input from other modalities. Zheng et al. (2021b) pioneers the MRE task with a benchmark dataset, which is collected from the social media posts that come with rich vision-language sources. Later, more delicate and sophisticated methods are proposed to enhance the interactions between the input texts and images, and achieve promising results Zheng et al. (2021a); Chen et al. (2022b,a).
On the other hand, increasing attention has been paid to exploring the role of different information in the RE task. As extensively revealed in prior RE studies, only a few parts of the input sentence can provide real clues for the relation inference (Xu et al., 2015;Yu et al., 2020), which inspires the proposal of textual feature pruning methods (Zhang et al., 2018;Jin et al., 2022). More recently, Vempala and Preoţiuc-Pietro (2019);  have shown that not always the visual inputs serve positive contributions in existing MRE models, as the social media data contains many noises. Xu et al. (2022) thus introduce an instance-level filtering approach to directly drop out those images less-informative to the task. However, such coarse-grained aggressive data deletion will inevitably abandon certain useful visual features. In this work we propose screening the noisy information from both the visual and textual input features, in a fine-grained and more controllable manner, i.e., structure denoising via graph information bottleneck technique (Wu et al., 2020). Also, we adopt the scene graph structures to model both the vision and language features, which partially inherits the success from Zheng et al. (2021a) that uses visual scene graphs to represent input images.
Due to the sparse and noisy characteristics of social media data Fei et al., ,b, 2021bFei et al., , 2022g, 2023, as well as the crossmodal information detachment, MRE also suffers from feature deficiency problems. We thus propose modeling the latent topic information as additional context features to enrich the inputs. Multimodal topic modeling has received considerable explorations (Chu et al., 2016;Chen et al., 2021), which extends the triumph of the textual latent topic models as in NLP applications (Zhu et al., 2021;Fu et al., 2020;Xie et al., 2022). We however note that existing state-of-the-art latent multimodal models Zosa and Pivovarova, 2022) fail to navigate the text and image into a unified feature space, which leads to irrelevant vision-text topic induction. We thus propose an effective latent multimodal model that learns coherent topics across two modalities. To our knowledge, we are the first to attempt to integrate the multimodal topic features for MRE.

Conclusion
In this paper, we solve the internal-information over-utilization issue and the external-information under-exploitation issue in multimodal relation extraction. We first represent the input images and texts with the visual and textual scene graph structures, and fuse them into the cross-modal graphs. We then perform structure refinement with the guidance of the graph information bottleneck principle. Next, we induce latent multimodal topic features to enrich the feature contexts. Our overall system achieves huge improvement over the existing best model on the benchmark data. Further in-depth analyses offer a deep understanding of how our method advances the task.

Limitiation
The main limitations of our work lie in the following two aspects: First, we take sufficient advantage of the scene graph (SG) structures, which are obtained by external SG parsers. Therefore, the overall performance of our system is subject to the quality of the SG parser to some extent. However, we show that our system, by equipping with the refinement mechanism, is capable of resisting the quality degradation of SG parsers to a certain extent. Second, the performance of the latent multimodal topic model largely relies on the availability of large-scale text-image pairs. However, the size of the dataset of MRE is limited, which may limit the topic model in achieving the best effect. Changmeng Zheng, Junhao Feng, Ze Fu, Yi Cai, Qing Li, and Tao Wang. 2021a

A Extended Method Specification
A.1 Node Embedding In Section 3.1, we directly give the representations of nodes in VSG and TSG. Here, we provide the encoding process in detail.
Visual Node Embedding In VSG, the visual feature vector of an object node is extracted from its corresponding image region; the feature of the attribute node is the same as its connected object, while the visual feature vector of a relationship node is extracted from the union image region of the two related object nodes. Specifically, for each visual node, we first rescale it to 224-d × 224-d. Subsequently, following Dosovitskiy et al. (2021), each visual node is split into a sequence of fixed-size non-overlapping patches {p k ∈ R P ×P }, where P × P is the patch size. Then, we map all patches of i-th visual node to a d-dimensional vector X P C i with a trainable linear projection. For each sequence of image patches, a [CLS] token embedding x CLS ∈ R d 1 is appended for the sequence of embedded patches, and an absolute position embeddings X P OS i also added to retain positional information. The visual region of i-th node is represented as: where [; ] denotes a concatenation. Then, we feed the input matrix Z i into the CLIP vision encoder to acquire the representationx I i . Note that the [CLS] token is utilized to serve as a representation of an entire image region: wherex I i ∈ R d 1 . Since the category label of each node can provide the auxiliary semantic information, a label embedding layer is built to embed the word label of each node into a feature vector. Given the one-hot vectors of the category label of each node, we first map it into an embedded feature vectorx I i by an embedding matrix W label ∈ R d 2 ×C label , where is initialized by Glove embedding (i.e., d 2 = 300), C label is the number of categories. And then, the embedding features of the category label corresponding to the node are fused to the visual features to obtain the final visual node embedding: Textual Node Embedding In TSG, we utilize CLIP as the underlying encoder to yield the ba-sic contextualized word representations for each textual node:

A.2 Graph Encoding
In Section 3.2 and Section 3.3, we introduce a graph attention model (GAT) to encode the crossmodal graph (CMG) and refined graph (G − ). Here, we provide a detail. Technically, given a graph G = (V, E), where V is the set of nodes, and E is the set of edges. And the feature matrix X ∈ R |V |×d 1 of V with d 1 -dimensions. The hidden state h i of i-th node will be updated as follows: where N (i) denotes the neighbors of i-th node, W 2 and W 3 are learnable parameters. In short, we denote the graph encoding as follows: : The target : The original graph : The compressed graph : Optimal Information Figure 11: The Venn diagram visualization of GIB.

A.3 Detailed GIB-guided Feature Refinement
Introduction to GIB Here, we provide more background information about the GIB principle. Given the original graph G, and the target Y , the goal of representation learning is to obtain the compressed graph G − which is maximally informative w.r.t Y (i.e., sufficiency, I(G, Y ) = I(G − , Y )), and without any noisy information (i.e., minimality, I(G − , G) − I(G − , Y ) = 0), as indicated in Fig. 11. To encourage the information compressing process to focus on the target information, GIB was proposed to enforce an upper bound I c to the information flow from the original graph to the compressed graph, by maximizing the following objectives: Eq. (24) implies that a compressed graph can improve the generalization ability by ignoring irrelevant distractors in the original graph. By using a Lagrangian objective, GIB allows the G − to be maximally expressive about Y while being maximally compressive about G by: where β is the Lagrange multiplier. For the sake of consistency with the main body of the paper, the objective can be rewritten to: However, the GIB objective in Eq. (26) is notoriously hard to optimize due to the intractability of mutual information and the discrete nature of irregular graph data. By assuming that there is no information loss in the encoding process (Tian et al., 2020), the graph representation z of G − is utilized to optimize the GIB objective in Eq. (1), leading to −I(G − , Y ) ∼ −I(z, G), I(G − , G) ∼ I(z, Y ). Therefore, the Eq. (26) can be computed as: 1-order 2-order 1-order 2-order (a) l-order context for a node (b) l-order context for an edge Figure 12: The l-order context for a node and an edge.
Attention Operation for Node Filtering and Edge Adjusting In Section 3.3, we utilize the l-order context to determine whether a node should be filtered or an edge should be adjusted since the nodes and edges in a graph have local dependence, as shown in Fig. 12. Here, we give a detail of the calculation for the Att(·) operation in Eq. (8) and Eq. (9). In Eq. (8), the attention operation can be computed as: where Φ({v i , ϕ(v i )}) is a function to retrieve the index of a node in a set. Similarly, we consider the l-order context to calculate the r e i,j in Eq. (9): Detailed GIB Optimization First, we examine the second term I(z, G) in Eq. (11). Same as Sun et al. (2022), we employ variational inference to compute a variational upper bound for I(z, G) as follow: where r(z) is the variational approximation to the prior distribution p(z) of z, which is treated as a fixed d 1 -dimensional spherical Gaussian as in Alemi et al. (2017), i.e., r(z) = N (z|0, I). We use reparameterization trick ((Kingma and Welling, 2014)) to sample z from the latent distribution according to p(z|G), i.e., p(z|G) = N (µ z , σ z ), where µ z and σ z is the mean vector and the diagonal co-variance matrix of z, which can be computed as: where a is the context feature of G − obtained from Eq.(10). z is sampled by z = µ z + σ z · ε, where ε ∈ N (0, I). We could reach the following optimization to approximate I(z, G): I(z, G) = KL(p(z|G)||r(z)) , where KL(·||·) is the Kullback Leibler (KL) divergence (Hershey and Olsen, 2007). Then, we examine the first term in Eq. (11), which encourages z to be informative to Y . We expand I(z, Y ) as: where q(Y |z) is the variational approximation of the true posterior p(Y, z). Eq. (33) indicates that minimizing −I(z, Y ) is achieved by minimization of the classification loss between Y and z, we model it as an MLP classifier with parameters. The MLP classifier takes z as input and outputs the predicted label.

A.4 Detailed Latent Multimodal Topic Modeling
Visual BoW Feature Extraction As mentioned in Section 2.3, we represent image I with visual Step 1: Detecting Objective Proposal

Detecting
Step 2: Featuring Objective Proposal Step 3: Building Codebook Encoding Clustering Step 4: Representing Images BoW (VBoW) features. Here, we introduce how to extract VBoW features from an image. We compute the objective-level visual words in the following four steps, as shown in Fig. 13: • Step 1: Detecting Objective Proposal. We first employ a Faster- RCNN (Ren et al., 2015) as an objective detector to extract all the objective proposals in the training dataset. • Step 2: Featuring Objective Proposal: We use a pre-trained vision language model to obtain the feature descriptors (vectors) of each objective proposal. • Step 3: Building the Codebook: After obtaining the feature vectors, these feature vectors are clustered by a kmeans algorithm, where the number of clusters is set to 2,000. Cluster centroids are taken as visual words. • Step 4: Representing Images: Similar to the extraction of Bag-of-word (BoW) features for text representation, we build the Visual Bagof-Word (VBoW) features for images. Specifically, using this codebook, each feature vector of the objective proposal in an image is replaced with the id of the nearest learned visual word.

Detailed Latent Topic Modelling Optimization
In Section 2.3, we directly provide the optimal objective. In the following, we introduce how to optimize LAMO concretely. First of all, the prior parameters of θ, µ and σ are estimated from the input data and defined as: where H is the contextualized representation obtained from CMG, f (·) is an aggregation function, and f * (·) is a neural perceptron that linearly transforms inputs, activated by a non-linear transformation. Note that we can generate the latent topic variable from p(θ|T, I) by sampling, i.e., = µ + σ · ε, where ε ∈ N (0, I). Then we employ Gaussian softmax to draw topic distribution θ: Similar to previous neural topic models only for handling text (Bianchi et al., 2021), we consider autoregressively reconstructing the textual and visual BoW features of input by learned topic distribution θ: p(b I i |ψ, θ) = Softmax(θ · ψ I |b I <i ) .
The objective function of latent multimodal topic modeling is to maximize the evidence lower bound (ELBO), as derived as follows: LLAMO =KL(q(θ)||p(θ|T, I)) where q(θ) is the prior probability of θ, set as a standard Normal prior N (0, I).