Interpretable Multimodal Misinformation Detection with Logic Reasoning

Multimodal misinformation on online social platforms is becoming a critical concern due to increasing credibility and easier dissemination brought by multimedia content, compared to traditional text-only information. While existing multimodal detection approaches have achieved high performance, the lack of interpretability hinders these systems' reliability and practical deployment. Inspired by NeuralSymbolic AI which combines the learning ability of neural networks with the explainability of symbolic learning, we propose a novel logic-based neural model for multimodal misinformation detection which integrates interpretable logic clauses to express the reasoning process of the target task. To make learning effective, we parameterize symbolic logical elements using neural representations, which facilitate the automatic generation and evaluation of meaningful logic clauses. Additionally, to make our framework generalizable across diverse misinformation sources, we introduce five meta-predicates that can be instantiated with different correlations. Results on three public datasets (Twitter, Weibo, and Sarcasm) demonstrate the feasibility and versatility of our model.


Introduction
Misinformation refers to incorrect or misleading information 2 which includes fake news, rumors, satire, etc. The enormous amount of misinformation emerged on online social platforms is attributed to users' reliability on the information provided by the internet and the inability to discern fact from fiction (Spinney, 2017). Moreover, widespread misinformation can have negative consequences for both societies and individuals. There-fore, there is an urgent need to identify misinformation automatically. While numerous posts are in multimodal style (i.e., text and image) on social media, this work concentrates on multimodal misinformation detection.
Multimodal approaches, which either fuse text and image features (Wang et al., 2018;Khattar et al., 2019;Xue et al., 2021;Chen et al., 2022b) or investigate discrepancies between the two modalities Qi et al., 2021), have been used for misinformation detection with some success. However, these methods often lack interpretability because of the black-box nature of the neural network. Some frameworks have been proposed to solve this challenge. As depicted in Fig. 1, methods based on attention maps, such as those outlined in (Liang et al., 2021) and (Liu et al., 2022a), have been employed to identify highly correlated text or image content (referred to here as "where") according to attention weights, while multi-view based methods, such as those described in (Zhu et al., 2022b) and (Ying et al., 2022), have been utilized to highlight the most contributive perspectives 3 (referred to here as "how"). However, the explainability of the fusion of such attention or views has yet to be fully established (Liu et al., 2022b), and these methods cannot concurrently illustrate both the "where" and "how" of the reasoning process. Such interpretability is crucial for ensuring trust, reliability, and adoption of deep learning systems in real-world applications (Linardatos et al., 2021;Sun et al., 2021;Cui et al., 2022), particularly when it comes to detecting misinformation (Cui et al., 2019).
To address the aforementioned limitations, owing to Neural-Symbolic learning (Raedt et al., 2020;Hamilton et al., 2022), we propose to incorporate logic reasoning into the misinformation detection framework to derive human-readable clauses. As shown in Fig. 1d, the clause b 1 ((v 1 , v 2 ), Rumor)∧ b 2 ((t 1 , t 2 ), Rumor) ⇒ h((T, I)), Rumor) is induced from the text-image pair where constants v 1 , v 2 , t 1 , t 2 are crucial visual patches and textual tokens for predication, corresponding to "where". Body predicates b 1 and b 2 indicate relationships between patches and tokens for misinformation identification, corresponding to "how". We propose to automatically learn these logic clauses which explicitly express evident features and their interactions to promote interpretability and improve the final performance, which has not been explored by previous work. However, given the intrinsic complexity and diversity of multimodal context, it is hard to explicitly predefine the exact relationships as logic predicates. To this end, we introduce five general perspectives relevant to the task of misinformation detection as meta-predicates for clause formulation. These perspectives include suspicious atomic textual content, visual content, relationships between text tokens, visual patches and both modalities. Each meta-predicate can be instantiated with different correlations between contents of the text-image pair and target labels (e.g., (t 1 , t 2 ) and Rumor in Fig. 1d), aiming to cover a wide range of aspects leading to misinformation. For instance, the fifth perspective implicates exploiting cross-modal contents to debunk misinformation while cross-modal ambiguity learning (Chen et al., 2022b), incon-sistency between news contents and background knowledge (Abdelnabi et al., 2022) and entities misalignment  are candidate correlations to achieve this goal.
Building upon these definitions, we propose a logic-based multimodal misinformation detection model (LogicDM). LogicDM first extracts embeddings for text tokens and image patches using corresponding encoders and then generates cross-modal object embeddings for different predicates using a multi-layer graph convolutional network (GCN). We then propose to parameterize meta-predicates by weighing the importance of each correlation. When combined with different object constants, these meta-predicates are softly selected to produce interpretable logic clauses defining the target predicate. The whole framework can be trained end-to-end with differentiable logic operators and probabilistic logic evaluations. To summarize, the contributions of this work include: 1) We propose an explainable neural-symbolic approach capable of automatically generating logic clauses instantiated with multimodal objects via differentiable neural components. 2) We define five meta-predicates building upon existing misinformation detection perspectives and introduce an adaptive mechanism to represent these predicates using soft selections over multiple pre-defined correlations. 3) We provide comprehensive evaluations of our model on three benchmark datasets.

Misinformation Detection
Misinformation detection has gained significant attention in recent years due to the proliferation of content on online social media (Alam et al., 2022). To identify misinformation, the text modality can be used with clues such as semantics (Zhu et al., 2022b;Ma et al., 2019), writing style (Zhou et al., 2019), emotion (Zhu et al., 2022b), special word usage (Zhu et al., 2022a), and punctuation (Pérez-Rosas et al., 2018;Rubin et al., 2016). In addition, image features can help detect misinformation, with fake and real news often having distinct image distribution patterns, including differences in image semantics and compression trace (Jin et al., 2017a,b). Intra-modal inconsistency and incongruity within the text or image (Tay et al., 2018;Huh et al., 2018) can also serve as indicators of misinformation. Cross-modal interaction and fusion, used by many recent multimodality-based methods, can assist in detecting misinformation. For example, Qi et al., 2021) compared the characteristics of entities across the textual and visual modalities, while Ying et al. (2022) measured cross-modal inconsistency through Kullback-Leibler divergence between unimodal distributions.

Neural-Symbolic Reasoning
Deep learning has achieved impressive results, but its limitations in interpretability and logical reasoning have been noted by (Hamilton et al., 2022). To address these limitations, the integration of symbolic reasoning and neural networks, known as Neural-Symbolic AI, has gained attention as a potential solution (Raedt et al., 2020). One approach enhances neural networks with structured logic rules, such as first-order logic, that act as external constraints during model training (Hu et al., 2016;Manhaeve et al., 2018;Wang and Pan, 2021;Chen et al., 2022a). The other approach, Inductive Logic Programming (ILP), aims to automatically construct first-order logic rules from noisy data (Cropper et al., 2022). There have been various proposed ILP architectures, including NeuralLP (Yang et al., 2017), LNN (Sen et al., 2022), δILP (Evans and Grefenstette, 2018), and RNNLogic (Qu et al., 2021). ILP has been applied in a range of areas including knowledge-base completion (Qu et al., 2021), question answering , and multi-hop reading comprehension (Wang and Pan, 2022). However, multimodal misinformation detection, unlike these previous applications, faces the challenge of lacking well-defined predicates and constants due to the unstructured and modalitydifferent text-image input.

Task Definition
In this paper, we aim to address the problem of multimodal misinformation detection. Given a text-image pair (T, I), we seek to predict its label. To incorporate logic reasoning into the neural network, we define a candidate label set Y = {NonRumor, Rumor} for rumor detection task while Y = {NonSarcasm, Sarcasm} for sarcasm detection task. We also define a 2-ary predicate h that takes as input a text-image pair and a label, with the implicit meaning that the text-image pair satisfies the label. Our goal can then be reformulated as selecting a label y ∈ Y such that h((T, I), y) holds. It is worth noting that this def- Figure 2: The core architecture of the proposed interpretable multimodal misinformation detection framework based on logic reasoning (LogicDM). Textual nodes are fully connected to visual nodes but we only visualize edges between one textual node and visual nodes for ease of illustration.
inition allows for the extension of our framework to multi-class classification tasks by increasing the size of the set of labels Y.

Inductive logic programming
To address the interpretability challenge in misinformation detection, we propose a framework that induces rules or clauses of the form b 1 ∧ . . . ∧ b q ⇒ h, where b 1 , . . . , b q are predicates in the body, h is the head predicate, and ∧ denotes the conjunction operation. The body predicates are 2-ary, defined over object variable O (i.e., combinations of text tokens, image patches, or both) and label variable Y (i.e., labels in the set Y). These predicates with associated variables, such as b(O, Y ), are referred to as logic atoms. By instantiating variables in body atoms with constants (e.g., b(o, y), where o is an object and y is a label), we can obtain truth values of these body atoms and subsequently derive the value of the head atom h((T, I), y) using logic operators (e.g., conjunction ∧ and disjunction ∨), where the truth value indicates the probability of the atom or clause being true and is in the range of 0 to 1, denoted as µ(·) ∈ [0, 1].

Methodology
This section introduces the proposed logicbased multimodal misinformation detection model (LogicDM), which offers a more explicit reason-ing process and better performance than existing approaches. The model consists of four main components: Feature Extraction, Cross-modal Object Generation, Clause Generation, and Clause Evaluation. Feature Extraction generates representations for text tokens and image patches using encoders. Cross-modal Object Generation constructs a crossmodal graph and applies a multi-layer graph convolutional neural network to generate multi-grained representations that constitute cross-modal objects as logic constants. Clause Generation produces dynamic embeddings for predicates (see Table 1) by weighing the importance of different correlations and considers the logic relationship among all predicates to adaptively derive probable logic clauses. These clauses, when instantiated with object constants, can be evaluated to determine the truth value as Clause Evaluation. The overview of this model is shown in Fig. 2 and a running example is depicted in Fig. 6.

Feature Extraction
Given text-image pair (T, I) as input, we first tokenize T into m tokens, denoted as X T = {w 1 , w 2 , . . . , w m }. Then we use BERT (Devlin et al., 2019) with a one-layer LSTM (Hochreiter and Schmidhuber, 1997) as the textual encoder to obtain d-dimension representations for all tokens in X T , given as T = [t 1 , t 2 , . . . , t m ], where T ∈ R m×d .
For image modality, we first resize the image to the size 224 × 224 and divide each image into r = z 2 patches, where the size of each patch is 224/z × 224/z. Similar to text modality, these patches are reshaped to a sequence, denoted as X I = {p 1 , p 2 , . . . , p r }. Then we exploit the pre-trained visual backbone neural network (e.g., ResNet34 (He et al., 2016) and ViT (Dosovitskiy et al., 2021)) to extract visual features and map these features to d-dimension using a two-layer

Cross-modal Object Generation
Cross-modal Object Generation aims to produce representations for constants (e.g., (v 1 , v 2 ), (t 1 , t 2 ) in Fig. 1) to instantiate logic clauses. Different from the common definition of constants as single objects (in images or texts), we define constants according to our newly introduced metapredicates. Specifically, we define meta-predicates as higher-level perspectives pertinent to discriminating misinformation. For this task, we use five meta-predicates, namely b t for single-token perspective, b v for single-image-patch perspective, b t,t for intra-text interactions, b v,v for intra-image interactions and b t,v for inter-modal interactions. The detailed explanations are shown in Table 1. The constants for these meta-predicates include a single token t i , a single image patch v i , a pair of tokens (t i , t j ), a pair of image patches (v i , v j ), and a pair consisting of both modalities (t i , v j ). The representations, denoted by o, for these constants are computed according to the formula in Table 1 and will be illustrated next.
The atoms, defined in Table 1, necessitate disparate uni-modal and cross-modal inputs, thus, requiring our model to capture intricate intra-modal and inter-modal representations concurrently. Inspired by recent work on multimodal task (Liang et al., 2021;Liu et al., 2020), we propose to construct a cross-modal graph G for (T, I) to leverage the relations among text tokens X T , image patches X I as well as those units between both modalities for computing representations of cross-modal constants.
Concretely, we take textual tokens X T and visual patches X I as nodes of graph G, i.e., the node matrix is the concatenation of X T and X I , denoted as [X T , X I ] and the initial node embedding matrix is the concatenation of text-modality and imagemodality representations, denoted as H = [T, V], where H ∈ R (m+r)×d . For edges, the semantic dependencies among textual tokens are first extracted by Spacy 4 . And if there exits a dependency between any two tokens, there will be an edge between them in G. Then visual patches are connected according to their geometrical adjacency in the image, following (Liu et al., 2022a). Additionally, we assume the text nodes and visual nodes are fully connected to each other to increase interactions between two modalities, thus reducing the modality gap. Finally, the adjacency matrix A ∈ R (m+r)×(m+r) can be represented as where p i−m and p j−m are determined as adjacent when Subsequently, a L-layer GCN (Kipf and Welling, 2017) is used to update each node embedding after fus- Table 1: The meaning of proposed five meta-predicates and formulas to produce cross-modal objects for each predicate. t l ∈ R d and v l ∈ R d denote textual and visual features obtained in the l-th iteration of GCN, and the subscripts i and j represents two different features. The bold symbol o ∈ R d represents the embedding of corresponding constant.
ing the information from its neighbor nodes via Especially, T l ∈ R m×d and V l ∈ R r×d are updated textual and visual representations at the l-th layer.
With T l and V l , we compute representations of the cross-modal objects m×r)×d as constants for those meta-predicates, according to formulas in Table 1. In subsequent illustrations, we omit the layer index l for ease of illustration. Intuitively, different objects have different importance for multimodal misinformation detection task. As such, we feed the embedding of each object to a separate MLP (one linear layer with a ReLU as the activation function) to compute its importance score corresponding to a specific meta-predicate. Then k objects are chosen for each meta-predicate based on their importance scores for clause generations and evaluations. We denote their representations asÔ t ,Ô v ,Ô t,t ,Ô v,v and O t,v , each of which belongs to R k×d .

Clause Generation
In Clause Generation, we derive logic clauses consisting of meta-predicates that deduce the head atom h ((T, I For each meta-predicate, we predefine a set of g fine-grained correlations (parameterized with embeddings) between objects and labels, denoted by C ∈ R g×d (i.e., For example, C t stores g correlations between text tokens and labels relevant to meta-predicate b t (t, y). These correlations can be flexibly combined to form an embedding for each meta-predicate with different instantiations.
Concretely, taking meta-predicate b t (t, y) as an example, the embedding B t for b t (t, y) with all instantiations t (i.e.,Ô t ) is computed as Here B t ∈ R k×d consists of k embeddings corresponding to k different objects extracted inÔ t . y is the d-dimension embedding of label y and is broadcasted to k × d for concatenation. W e t ∈ R 2d×d is a learnable matrix. In addition, we utilize sparsemax, a sparse version of softmax, to select only a small number of correlations, which has been proven effective in multi-label classification tasks (Martins and Astudillo, 2016). The intuition of Eq. 2 is to softly select correlations to form the metapredicate embedding when the input constants are t and y. By adapting Eq. 2 to other meta-predicates, we obtain a complete set of predicate embeddings Furthermore, we obtain the embedding of the entire text input t T ∈ R d and image v I ∈ R d via weighed summations of all tokens and patches, respectively: t T = T ⊤ softmax(TW T ) and v I = V ⊤ softmax(VW I ), where W T ∈ R d×1 and W I ∈ R d×1 are trainable parameters to compute importance scores of tokens and patches.
To generate valid clauses, given the predicate embeddings B, textual representation t T and image representation v I , we use two sparse attention networks to select relevant predicates pertinent to the image-text input, as well as the given label, to form the body of a clause. Formally, we have two attention scores S T,I and S y indicative of the input text-image pair and label respectively, given as where W T,I ∈ R d×2d and W y ∈ R 4d×1 are learnable parameters. The final score S ∈ R 5k is obtained via Each score in S indicates the probability of its corresponding predicate being selected to deduce the head atom h((T, I), y). Then ⌊5k × β⌋ atoms ranking at the top of S are selected to complete the clause generation, where β ∈ (0, 1) is a hyper-

Clause Evaluation
In Clause Evaluation, we aim to derive the truth value of the head atom for each clause, given body atoms which are instantiated with constants. Specially, given an atom b t (t, y), its truth value To obtain the truth value of the head atom, we approximate logic operators ∧ and ∨ using product t-norm, an example of T-Norm (i.e., T : (Klement et al., 2000).
referring to truth values of atoms. With Product t-norm, the truth value of the head atom µ(h((T, I), y)) can be derived as long as the value for each body atom is given. Recall that our GCN model generates representations for each layer l ∈ {0, ..., L}. Therefore, with logic clauses b l 1 ∧...∧b l n ⇒ h((T, I), y) generated for each layer l, we use disjunctive operators to combine clauses across all the layers as For the target task of multimodal misinformation detection, given (T, I), we derive truth values µ(h((T, I), y)) for different candidate labels y, e.g., y ∈ {N onRumor, Rumor}. Then a crossentropy loss is adopted to train our model in an end-to-end manner which maximizes the truth values for gold labels. During inference, we compare the truth values for both labels and pick the one corresponding to a larger value as the final prediction.

Experiment Setup
We verify the effectiveness of our approach on two public misinformation datasets (Twitter and Weibo) and further demonstrate its versatility on a sarcasm detection dataset (Sarcasm). Three datasets are  Table 2 and Table 3 present comparison results for multimodal misinformation detection and sarcasm detection tasks against popular baselines. Despite well-recognized tradeoffs between performance and model interpretability (Raedt et al., 2020), both tables indicate our proposed LogicDM consistently surpasses existing state-of-art methods in terms of both Accuracy and F1 Score. Especially our model brings 3.9% and 1.2% improvements based on accuracy over state-of-art BMR on Twitter and CAFE on Weibo. Moreover, our model demonstrates superior Precision than other baselines on Sarcasm. Such results verify the advantage of the integration of logical reasoning and neural network. We conjecture that logic components may motivate our model to learn useful rules instead of overfitting to noise. In addition, it is also worth mentioning that there is a difference in performance between Rumor and Non Rumor on Twiiter, which may be due to unbalanced proportions within the training set.

Overall Performance
Furthermore, it is observed that multi-modality based methods generally outperform uni-modality based methods, suggesting that text and image can provide complementary information to enhance detection performance. In addition, CAFE and BMR can estimate the importance of different modalities to adaptively aggregate unimodal representations by ambiguity measure component and multiview learning, thus, showing better performance than simple fusion or concatenation. In contrast,    our model achieves this goal by softly choosing predicates to induce logic clauses when taking into consideration the logic relationship among these predicates.

Interpretation Study
To illustrate the interpretability of our proposed framework LogicDM, we visualize the learned rules in Fig. 3. Despite the complicated text-image input, it is evident that our model can explicitly locate highly correlated content as constants for "where" and softly choose suitable meta-predicates for "how". For example, as shown in Fig. 3c, objects "a city" and "my baby" are selected to instantiate b 1 (i.e., b t,t ) and b 2 (i.e., b t ) where both predicates implicate that samples with indefinite pronouns are more likely to be rumors. By comparison, samples of proper nouns can usually be detected as non-rumors because of their more realistic description, as seen in Fig. 3d. Moreover, the derived explanation can provide supplementary insights and knowledge previously unknown to practitioners. For example, as seen from Fig. 3a, the logic reasoning based on two visual patches, b 1 , b 2 (i.e., both are b v ) implies that these areas are hand-crafted 5 (i.e., produced by Photoshop), which is difficult to be discriminated by human-beings. Furthermore, our model can mitigate the trust problem of AI systems according to further analyzing derived clauses. For instance, although the nonrumor in Fig. 3b is identified accurately, it may not be sufficiently convincing based on only "tower", "landmark" and relevant predicates b 1 , b 2 (i.e., both belongs to b t,t ). In other words, the decision result may not be reliable in this case. The interpretability of the model allows for further understanding of the decision-making process, thus increasing the reliability and trustworthiness of the system.

Ablation Study
In the ablation study, we conduct experiments to analyze the impact of different parameters for performance, including the number of correlations g and rate β in Sec. 4.3 as well as selected iterations l in Sec. 4.4. For illustration, we report the precision, recall, F1 Score of rumor and accuracy on Twitter and Weibo datasets. Impact of Number of Correlations. In order to effectively deal with the diverse online misinformation, we propose to adaptively represent predicates through their corresponding correlation sets in Clause Generation. As seen in Fig. 4, the influence of varying numbers of correlations (i.e., g) on performance reveals that the results dramatically increase as g increases and then gradually decrease after reaching a peak (e.g., 10 for the Twitter dataset and 15 for the Weibo dataset). These results validate the effectiveness of dynamic predicate em- bedding mechanism and suggest that the optimal number of correlations depends on the complexity of specific scenarios. However, it should be noted that our model can be tolerant of an excessive number of correlations without significantly impacting performance. Impact of Logic Clause Length. In Clause Generation, we deduce the logic clause of a fixed length by adjusting rate β. As illustrated in Fig. 5, it is evident that the performance drops significantly as β increases from 0.15. This observation can be attributed to two possible reasons: 1) Product t-norm may result in exponential decay when the number of atoms in the clause grows, leading to decreased stability, as previously reported in literature (Wang and Pan, 2022). 2) Including redundant logic atoms may inevitably introduce noise and negatively impact performance. These findings suggest that a moderate β is optimal for clause generation.   tion, we obtain the final truth value of head atom h((T, I), a) by selectively aggregating clauses produced at different iterations of GCN based on disjunction operator ∨. Table 4 compares various ways for computing µ(h((T, I), a)), revealing that our model achieves the best performance when l = 2 while yielding the worst performance when l = 0. Such results highlight the importance of capturing intra-modal and inter-modal interactions of multimodal input through multi-layer GCN for our task. Furthermore, it is observed that disjunctive combination clauses perform more robustly than nondisjunctive combination clauses on Weibo, potentially due to the logic-based fusion of information at different iterations. These results provide insights into the importance of incorporating multiple iterations in clauses for better performance in some cases.

Conclusion
We propose an interpretable multimodal misinformation detection model LogicDM based on neuralsymbolic AI. We predefine five meta-predicates and relevant variables evolved from corresponding misinformation detection perspectives. And we propose to dynamically represent these predicates by fusion of multiple correlations to cover diversified online information. Moreover, we differentiate reasoning process to smoothly select predicates and cross-modal objects to derive and evaluate explainable logic clauses automatically. Extensive experiments on misinformation detection task demonstrate the effectiveness of our approach and external experiments on sarcasm detection task reveal the versatility.

Limitations
Our work has two limitations that may impact the generalization ability of our proposed framework. Firstly, in the Clause Generation section (Sec. 4.3), we deduce logic clauses involving a fixed number of atoms, represented by ⌊5k × β⌋, rather than variable length for each iteration of GCN. While this approach has demonstrated superior performance on the multimodal misinformation detection and sarcasm detection tasks, it may harm the generalization of our framework to more complex multimodal misinformation tasks, such as the detection of fake news that involves various modalities, including social networks, text, user responses, images and videos, as discussed in (Zhou and Zafarani, 2021;Alam et al., 2022). Secondly, in our work, the incorporation of logic into the neural network relies on the use of product t-norm to differentiate logic operators (i.e., ∧ and ∨). However, as shown in the Ablation Study (Sec. 5.4), product t-norm may lead to vanishing gradients with the increase of logic atoms during the training stage, which may limit the ability of our proposed framework to handle more sophisticated scenarios. We plan to address these limitations in future research.

Ethics Statement
This paper complies with the ACM Code of Ethics and Professional Conduct. Firstly, our adopted datasets do not contain sensitive private information and will not harm society. Secondly, we especially cite relevant papers and sources of pretrained models and toolkits exploited by this work as detailed as possible. Moreover, our code will be released based on the licenses of any used artifacts. At last, our proposed multimodal misinformation detection approach will contribute to protecting human beings from the detrimental and unordered online environment with more trustworthy interpretations. ter and Sarcasm and bert-base-chinese 7 for Weibo) with one-layer LSTM as textual encoder to extract 200-dimension textual features. For visual modality, we divide the 224 × 224 image into 32 × 32 patches (i.e., r = 49, z = 7). We utilize ResNet34 8 as visual backbone for Twitter and Weibo, following (Chen et al., 2022b) and ViT 9 for Sarcasm, following (Liu et al., 2022a). The extracted visual features are subsequently mapped to the same dimension as textual features. In Cross-modal Objects Generation, we apply two-layer GCN (i.e., L = 2) to generate high-level representations of textual tokens and visual patches and then k = 5 to filter out five candidate objects for each metapredicate. In Clause Generation, we set the number of correlations g = 10 and β = 0.1 to derive explainable logic clauses of length ⌊5k × β⌋. At last, we set h((T, I), a) = b 2 0 ∧ ... ∧ b 2 ⌊5k×β⌋−1 (i.e., l ∈ {2}) to obtain the truth value of the target atom in Clause Evaluation. The number of parameters of our model is 4601019 without taking parameters of Bert and the visual backbone neural network (i.e., ResNet and ViT) into account.

References
During model training, we set the batch size to 32, the epoch number to 20 and exploit Adam optimizer for three sets. Additionally, we adopt an initial learning rate of 0.0001 and a weight decay of 0.0005 for Twitter and Weibo and 0.00002 and 0.0005 for Sarcasm. Moreover, early stopping strategy is used to avoid overfitting. And we run our experiments on four NVIDIA 3090Ti GPUs.
For model evaluation, in accordance with prior research (Chen et al., 2022b), we report Accuracy, and Precision, Recall, F1 Score for rumor and non rumor on Twitter and Weibo, while Accuracy, and Precision, Recall, F1 Score for sarcasm posts on Sarcasm.

B Baseline Models
To comprehensively evaluate our proposed method LogicDM, we divide the baseline models into two categories: Uni-Modal and Multi-Modal methods. For Uni-Modal baselines, we adopt Bert (Devlin et al., 2019) where the mean embedding of all tokens is utilized for classification and pretrained visual backbone networks where the feature representation after the final pooling layer is used. Specif-7 https://huggingface.co/bert-base-chinese 8 https://pytorch.org/vision/main/models/ generated/torchvision.models.resnet34 9 https://github.com/lukemelas/ PyTorch-Pretrained-ViT ically, for the visual backbone model, we adopt ResNet (He et al., 2016) for Twitter and Weibo datasets as suggested by Chen et al. (2022b), and adopt ViT (Dosovitskiy et al., 2021) for sarcasm detection dataset by following Liu et al. (2022a).
For Multi-Modal baselines, we utilize different approaches for multimodal misinformation detection and sarcasm detection due to the discrepancy between both tasks. Concretely, for Twiter and Weibo, we adopt Vanilla, EANN (Wang et al., 2018), MAVE (Khattar et al., 2019), SAFE (Zhou et al., 2020), MVNN (Xue et al., 2021), CAFE (Chen et al., 2022b), BMR (Ying et al., 2022). Especially, Vanilla fuses the textual and visual features extracted by corresponding encoders of our proposed LogicDM for classification and we reimplement BMR by using the same Feature Extraction component as our method and removing image pattern branch for a fair comparison. For Sarcasm, we utilize HFM (Cai et al., 2019), D&R Net (Xu et al., 2020), Att-BERT (Pan et al., 2020), InCrossMGs (Liang et al., 2021) and HCM (Liu et al., 2022a). Figure 6: The running sample of our proposed Log-icDM. In this example, we set ⌊5k × β⌋ = 2, implying that the derived clauses at each iteration are constituted of two logic atoms and the number of GCN layers is L = 2.

C Running Example
To facilitate understanding of the integral reasoning process, we provide an external running example as depicted in Fig. 6. The integral reasoning process can be summarized as follows: 1) Given the text-image pair as input, our model first extracts textual features T and visual features V using corresponding encoders. 2) These features are exploited to construct a cross-modal graph, denoted by the adjacency matrix A in Eq. 1 and node matrix H = [T, V]. This graph is fed into an L-layer GCN to conduct cross-modal reasoning. Especially at the iteration l of GCN, the output of GCN H l is taken to construct cross-modal objects O l t , O l v , O l t,t , O l v,v and O l t,v , corresponding to each predicate, using formulas in Table 1. These objects are then refined through a purification process to retain only the most salient ones, denoted asÔ l t ,Ô l v ,Ô l t,t ,Ô l v,v andÔ l t,v , serve as constants to instantiate logic clauses. 3) To derive logic clauses at the iteration l, we obtain the predicate representations by weighting the importance of each clue in the corresponding clue set C for each pair of objects and label y using Eq. 2. Then two atoms from B l are selected to derive logic clauses b l 0 ∧ b l 1 based on the importance score S l in Eq. 4. 4) As each iteration will produce one logic clause, the final logic clause can be deduced by , y), of where the truth value can be computed based on Eq. 5 and product t-norm. In this example, we only choose b 2 0 (v 1 , Rumor) ∧ b 2 1 (v 2 , Rumor) as the final clause.

ACL 2023 Responsible NLP Checklist
A For every submission:

A1. Did you describe the limitations of your work?
In Section Limitations.
A2. Did you discuss any potential risks of your work?
Our work focuses on the multimodal misinformation task and will not harm society. We illustrate it in Section Ethics Statement in detail.
A3. Do the abstract and introduction summarize the paper's main claims?
In Section Abstract and Sec.1 Introduction.
A4. Have you used AI writing assistants when working on this paper?
Left blank.

B Did you use or create scientific artifacts?
Yes, we specify the pre-trained models (Bert, ViT, ResNet) and the toolkit (Spacy) we used in Sec. A Implementation. Moreover, we use the original Twitter, Weibo, and Sarcasm datasets, and the preprocessing follows existing work. For our code, we obey MIT License.

B1. Did you cite the creators of artifacts you used?
Yes, in Sec. 5.1 Experiment Setup and Sec. A Implementation in Appendix.
B2. Did you discuss the license or terms for use and / or distribution of any artifacts? Spacy, Our proposed framework LogicDM, ViT : MIT License Bert, ResNet, Twitter dataset: Apache License 2.0 For the other two public datasets, we cannot find related licenses.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified? For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)? In Sec. Ethics Statement.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it? Not applicable. Related papers on three datasets (i.e., Weibo, Sarcasm, Twitter) have been published MM, ACL, International Journal of Multimedia Information Retrieval, and these datasets are popular datasets on misinformation and sarcasm detection tasks. We believe there is no offensive and private information in these datasets.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Yes, in Sec. A Implementation in Appendix and more detail can be found in the URLs provided in the footnotes.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results. For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be. In Sec. 5.1 Experiment Setup.
The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.

C Did you run computational experiments?
In Sec. 5 Experiment.
C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used? In Sec. A Implementation in the Appendix.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values? In Sec. A Implementation in the Appendix. In addition, for hyperparameter search, we search optimal learning rate from [0.001, 0.0001, 0.0005, 0.00002] and weight decay from [0.0005, 0.0001, 0.001]. However, we fix hyperparameters during model training for reported results.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run? We report the average results of three runs to avoid randomness.
C4. If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? In Sec. A Implementation in the Appendix.
D Did you use human annotators (e.g., crowdworkers) or research with human participants?
Left blank.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? No response.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)? No response.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating? For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used? No response.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board? No response.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data? No response.