Leveraging Contrastive Learning and Knowledge Distillation for Incomplete Modality Rumor Detection

,


Introduction
Social media platforms like Twitter and Weibo allow people to contribute vast amounts of content to the Internet.However, this content is often rife with rumors, which can lead to significant societal problems.For instance, in the first three months of 2020, nearly 6,000 people were hospitalized due to coronavirus misinformation 2 , while COVID-19 vaccine misinformation and disinformation are estimated to cost 50 to 300 million each day 3 .
Humans are generally susceptible to false information or rumors and may inadvertently spread them (Vosoughi et al., 2018).Suffering from the low coverage and long delay of rumor detection manually, automatic rumor detection models are essential.Previous text modality-based rumor detection models focused on exploring propagation (Lao et al., 2021), user information (Li et al., 2019), and writing styles (Przybyla, 2020;Xu et al., 2020).
Furthermore, as reported in Jin et al. (2017), microblogs with pictures have been found to have 11 times more access than those without pictures, which highlights the importance of multimodal content in rumor detection.Specifically, some studies have investigated the fusion of different modalities, such as images and text, by directly concatenating their representations (Khattar et al., 2019;Singhal et al., 2022).Nevertheless, directly concatenating two different modalities (i.e., textual and visual modalities) may not capture their deep semantic interactions.To address this issue, co-attention-driven rumor detection models aim to extract the alignment between the two modalities (Wu et al., 2021;Zheng et al., 2022).However, current multimodal rumor detection models typically overlook the problem of incomplete modalities, such as the lack of images or text in a given post, making them less effective in handling such cases.
To address the above issues, we propose a novel approach, Contrastive Learning and Knowledge Distillation for Incomplete Modality Rumor Detection (CLKD-IMRD).In fact, supervised contrast learning effectively pulls together representations of the same class while excluding representations from different classes, and knowledge distillation can effectively handle the incomplete modality cases when debunking rumors.More specifically, we first construct a teacher model that consists of multimodal feature extraction, multimodal feature fusion, and contrastive learning.Then, we adopt knowledge distillation to construct a student model that can handle incomplete modalities (i.e., lack of images or texts).Experimental results and visualizations demonstrate that our CLKD-IMRD outperforms state-of-the-art methods on both English and Chinese datasets in rumor detection on social media.This paper makes three major contributions.
(1) We propose a novel rumor detection framework that integrates the supervised contrastive learning into our teacher network.This framework captures the deep semantic interactions among source texts, images, and user comments simultaneously.
(2) We present a knowledge distillation driven rumor detection model that can handle incomplete modalities (i.e., lack of images or texts), which is a common phenomenon on social media.
(3) We conduct extensive experiments and visualization to verify the effectiveness of the proposed CLKD-IMRD model on the four benchmark rumor corpora compared to many strong baseline models.
2 Related Work

Single Modality-based Models
Generally, propagation patterns are good indicators of rumor detection, because the interaction among different kinds of users (e.g., normal users and opinion leaders) or a source microblog and its subsequent reactions can help to detect rumor.As mentioned in (Wu et al., 2015), a rumor was first posted by normal users, and then some opinion leaders supported it.Finally it was reposted by a large number of normal users.The propagation of a normal message, however, was very different from the propagation pattern of a fake news.The normal message was posted by opinion leaders and was reposted directly by many normal users instead.Ma et al. (2017) also proposed a propagation path-based model to debunk rumors.Lao et al. (2021) presented a propagation-based rumor detection model to extract the linear temporal sequence and the non-linear diffusion structure simultaneously.
In addition, user information is also a good insight to rumor detection, because the authoritative users are unlikely to publish a rumor, while normal users have a high probability to produce or repost a rumor instead.Specifically, Mukherjee and Weikum (2015) assessed user credibility based on several factors, including community en-gagement metrics (e.g., number of answers, ratings given, comments, ratings received, disagreement and number of raters), inter-user agreement, typical perspective and expertise, and interactions.Yuan et al. (2020) proposed a fake news detection model based on integrating the reputation of the publishers and reposted users.Li et al. (2019) presented a user credit driven multi-task learningbased framework to conduct rumor detection and stance detection simultaneously.
Furthermore, the writing style is a crucial factor in rumor detection, as there are often distinct differences in the vocabulary and syntax used in rumors compared to non-rumors.Specifically, Rubin et al. (2015) proposed a rhetorical structurebased framework for news verbification.The rhetorical structure is a wildly used theory in discourse analysis, which describes how the constituent units of a discourse are organized into a coherent and complete discourse according to a certain relationship.According to the rhetorical structure theory, the relationship between discourse units mostly shows the nucleus-satellite relationship.Compared with the author's communicative intention, the discourse unit in the core position is in a relatively important position, and there are different discourse relationships between the core discourse unit and the satellite discourse unit.Potthast et al. (2018) adopted some lexical features (e.g., characters uni-gram, bi-gram and tri-gram, stop words, part of speech, readability value, word frequency, proportion of quoted words and external links, number of paragraphs, and average length of a text) for fake news detection.Przybyla (2020) presented a writing style-based fake news detection framework to verify the effectiveness of sentiment vocabularies.

Multimodal-based Models
Wang et al. ( 2018) presented a GAN-based multimodal fake news detection model to adopt VG-G (Visual Geometry Group) to extract visual features and employ a CNN (Convolutional neural network) to extract text features simultaneously.They also integrated the characteristics of the invariance of an event to facilitate the detection of the newly arrived events of fake news.Khattar et al. (2019) proposed a variational auto-encoderbased fake news detection model to capture the shared representation between textual and visual modalities.Their novel decoder can utilize Despite the emergence of many multimodal rumor detection models in recent years, they often overlook the distinguishing features of samples with the same or different types of labels.Additionally, these models do not account for incomplete modality situations, where the image may fail to load, which is a common occurrence on the Internet.To address these gaps, we propose a novel framework based on contrastive learning and knowledge distillation to effectively debunk rumors with incomplete modalities.

Task Formulation
Let's define P = {p 1 , p 2 , ..., p n } as a set of posts.Each post p i consists of {t i , v i , c i }, where t i indicates a source text, v i donates an image, and c i refers to a comment.We approach rumor detection as a binary classification task, with a goal of learning a function f (p i ) → y, where p i represents the given multi-modal post, and y represents the label assigned to the post, where y=1 indicates a rumor and y=0 indicates a for non-rumor.

Framework of CLKD-IMRD
Figure 1 illustrates our proposed CLKD-IMRD framework for rumor detection.Specifically, we employ a multimodal feature extraction module to obtain representations of the source text, images, and comments from a given post.For the teacher model, the extracted multimodal features are fed into a multimodal feature fusion module.During the multimodal feature fusion phase, we adopt visual features to enhance textual features, using the cross-modal joint attention mechanism to obtain enhanced features between textual and visual representations.Then, in the output layer module, we integrate features from different modalities into a supervised contrastive learning framework.For the student model which lacks visual modality, we directly concatenate the representation of source text and comments, and adopt knowledge distillation to obtain corresponding classification results.

Teacher Model
The teacher model is composed of three modules: multimodal feature extraction, multimodal feature fusion, and output layer.
Multimodal Feature Extraction: For textual feature extraction, we utilize a CNN to obtain the textual representations of source text and comments.Given a text t i in a post p i , we first obtain its representation as indicates the word embeddings of the j-th word in a text t i , and o ∈ R d where d donates the dimension size of the word embedding.Then, the word embedding matrix O is fed into a CNN framework to obtain the feature map where k is the size of receptive field.Next, we perform maxpooling on S i to obtain the Ŝi = max(S i ) and extract the final representation for the source tex- ) through concatenating of different receptive fields.Similarly, we extract the representation for comment ).We utilize ResNet-50 (He et al., 2016) to extract the representation for image v i .Specifically, we first extract the output of the second-to-last layer of ResNet-50 and represent it as V i r , feeding it into a fully connected layer to obtain the final visual features as with the same dimension size as the textual features.
Multimodal Feature Fusion: To capture the interaction among different modalities and enhance cross-modal features, we employ a coattention (Lu et al., 2019) mechanism.For both textual and visual modalities, we first adopt multihead self-attention (Vaswani et al., 2017) to enhance the inner feature representation.For instance, given a textual feature R i t , we adopt H are linear transformation where H donates the total number of heads.We obtain the multi-head self-attention feature of textual modality as equation 1 as follows.
where "||" donates concatation operation.h refers to the h-th head, and d) indicates the output of linear transformation.
Similarly, we obtain the multi-head selfattention feature of visual modality as equation 2 as follows.
To extract the co-attention between the textual and visual modalities, we perform a similar selfattention process, replacing , respectively, and finally obtain the enhanced textual features Z i vt with visual features as equation 3.
Next, we perform the second co-attention between Z i vt and Z i v to obtain the cross-modality feature Zi vt and Z i tv as equation 4 and 5, respectively.
Finally, we conduct concatenation for the two enhanced features and the initial comment features to obtain the final multimodal features as ).Output Layer: Given that supervised contrast learning (SCL) (Khosla et al., 2017) effectively pulls together representations of the same class while excluding representations from different classes, we incorporate the supervised contrast learning function into our total loss for the rumor detection as equation 6.To enhance the robustness of the proposed model, we introduce projection gradient descent (PGD) (Madry et al., 2017) into textual embeddings when calculating the gradient of textual features in each training iteration and adopt it to calculate adversarial perturbations.Then, we recalculate the gradient on the updated textual features, repeat this process with m times, and adopt spherical space to limit the range of disturbance.Finally, we accumulate the above adversarial gradient to the original gradient and adopt it for parameter adjustment.
where C indicates the label type (i.e., rumor and non-rumor), y i,c donates the true label with class type c, and ŷi,c refers to the predicted probability with class type c.
) where I donates the set of indexes of training samples, P i embodies the set of indexes of positive samples, A i refers to the indexes of contracted samples, Z i stands for the normalization, and Γ indicates a temperature parameter to control different categories.

Student Model
As mentioned in the Introduction section, existing rumor detection models ignore the incomplete modality problem.It is common for the image to fail to load in a given multimodal post.Due to the lack of visual information, only textual information can be used for rumor detection.To address this issue, we adopt knowledge distillation (Hinton et al., 2015) to perform incomplete rumor detection based on our pre-trained teacher model.The motivation behind knowledge distillation is to leverage the soft labels predicted by the teacher network to guide the learning of the student network and improve its performance.By minimizing the distance between the soft probability distributions of the student and teacher models, as measured by the KL loss (equation 9), we aim to align the predictions of the student model with those of the teacher model.In essence, knowledge distillation serves the purpose of incorporating soft targets associated with the teacher network, which exhibits complex yet superior prediction performance, into the overall Loss function.
This facilitates the training of the student network, which is simplified, possesses lower complexity, and is more suitable for deployment in inference scenarios.The ultimate goal is to achieve effective knowledge transfer.
(9) where q t and q s donate the output of teacher and student network, respectively, σ indicates softmax, τ refers to scale the temperature of the smoothness of two distributions.A lower value of τ will sharpen the distribution, leading to an expanded difference between the two distributions.It concentrates the distillation on the maximum output predicted by the teacher network.On the other hand, a higher value of τ will flatten the distribution, narrowing the gap between the teacher and student networks.This broader distribution concentrates the distillation on the entire output range.Then, the total loss of student model is shown in equation 10.
where CE donates cross entropy, and α is a hyper parameter.
Baselines: We take seven baseline models as illustrated in Appendix A.1.
Hyper-parameter Settings: Following the approach of existing baseline systems, we divide the dataset into training, validation, and testing sets using a ratio of 7:1:2, respectively.For word embeddings, we employ Word2Vec-style embeddings as proposed in (Yuan et al., 2019).The number of head H in self-attention is set to 8. We adopt Adam (Kingma and Ba, 2014) to optimize our loss function.The learning rate is set to 0.002, and the batch size is set to 64.The value of dropout is set to 0.6.The τ in knowledge distillation is set to 5.0.The Γ in supervised contrastives learning is set to 0.5.The length of source text L t and comment L c are set to 50.α is set to 0.7, and λ is set to 0.5.The number of m in resistance disturbance is set to 3. We perform 5 runs throughout all experiments and report the average results and standard deviation results.

Results and Discussion
Model Comparison: Tables 2 and 3 present the average performance and standard deviation obtained from five executions on both the Chinese and English datasets.On the Pheme and Weibo-19 datasets, since we used the same training, validation, and testing splits as the baseline systems, we directly compare our results with theirs.In addition, we execute the public available source codes from these baselines on the Twitter and Weibo-17 datasets.Since EBGCN and GLAN constructe a propagation graph by using comments in the source posts, we don't report their performance on the Weibo-17 and Twitter datasets without comments information.It is evident from Tables 2 and  3 that our CLKD-IMRD outperforms other models in terms of accuracy, precision, recall, and F1-Score measures.This highlights the significance of multimodal feature fusion and contrastive learning in our approach.While ChatGPT has proven effective in various NLP tasks, its performance in rumor detection is not satisfactory.On the Weibo-17 and Twitter datasets, only short source texts are utilized without comments as a supplementation, resulting in poor performance of the ChatGPT on the rumor detection task.Based on the observations from Tables 2 and 3, we can derive the following insights: (1) Among the three multi-modal baselines (EANN, MVAE, and SAFE), SAFE achieves the highest performance in all four measures: accuracy, precision, recall, and F1-Score.On the other hand, MVAE demonstrates the poorest performance across all four measures on the Weibo dataset, which highlights the ineffectiveness of the superficial combination of textual and visual modalities in MVAE.In contrast, the incorporation of event information in the EANN model proves beneficial for debunking rumors.Notably, the SAFE model successfully incorporates a deep interaction between textual and visual modalities, resulting in superior performance.
(2) Among the three social graph-based baselines (EBGCN, GLAN, and MFAN), they demonstrate better performance compared to the simpler EANN and MVAE models.Both EBGCN and GLAN achieve comparable performance as they incorporate structural information.However, M-FAN, which combines textual, visual, and social graph-based information, outperforms the others in all four measures: accuracy, precision, recall, and F1-Score.
Performance of Knowledge Distillation: CLKD-IMRD involves adopting a multimodal contrastive learning model as the teacher model, and multimodal and incomplete modal models as the student models.Such teacher-student framework allows us to transfer knowledge from the multimodal teacher model to both multimodal and single-modal student models.We explore four student models, each utilizing only cross-entropy loss as the loss function.
• Student-1: The model incorporates all modalities, including source text, visual information, and user comments.
• Student-2: The model focuses on textual (source text) and visual modalities.
• Student-3: The model relies on the textual modality, considering both the source text and comments.
• Student-4: The model exclusively relies on the source text modality.
Limited to space, the knowledge distillation results on the Weibo-19 and Pheme datasets are shown in Table 4, which indicate that all student models exhibit improvement when guided by the teacher model.Even student-4, which only includes the source text modality, demonstrates a 1.0%-1.7%enhancement in accuracy and F1-Score measures.Similar improvements are observed in the other three student models (student-1, student-2, and student-3).Among these, student-1, utilizing all modalities (source text, visual information, and user comments), achieves the best performance across all four measures.Generally, student-2 outperforms student-3 due to its incorporation of both textual and visual modalities, while student-3 relies solely on textual information.
Ablation Study: Limited to space, Table 5 presents the performance of ablation analysis on the Weibo-19 and Pheme datasets, where we examine the impact of various components by considering five cases: • w/o text: We exclude the use of source text.
• w/o image: We omit the utilization of image information.
• w/o comment: We disregard the inclusion of comments.
• w/o contrastive learning: We eliminate the application of contrastive learning.
• w/o projection gradient descent: We do not employ projection gradient descent.
Based on the findings in Table 5, several conclusions can be drawn.1) The source text plays a crucial role in rumor detection.The performance significantly deteriorates when the source text is excluded, underscoring the importance of the textual modality in identifying rumors.2) Both images and comments contribute to debunking rumors, as evidenced by their absence leading to a decline in performance.3) The integration of supervised contrastive learning enhances the model's ability to distinguish between positive and negative samples in the corpora, which positively impacts the performance of the model.4) The inclusion of projection gradient descent during the adversarial perturbations training phase improves the robustness of our proposed model.

Impact of Co-attention Settings:
We further analyze the performance comparison with different number of co-attention as illustrated in Appendix A.2.

Impact of Number of Comments
We further analyze the performance of different comment scenarios, considering the following six cases: • 0% comment: No comments are used in this case.
• only the first comment: Only the first comment is considered.
• 20% comments: We include 20% of the comments in a time-sequential manner.
• 50% comments: We include 50% of the comments in a time-sequential manner.• 80% comments: We include 80% of the comments in a time-sequential manner.
• all comments: All comments are included.
Limited to space, the impact of number of comments results on the Weibo-19 and Pheme are shown in Table 6.Based on the findings in Table 6, we can draw the conclusion that increasing the number of comments does not contribute significantly to debunking rumors.In fact, as the number of comments increases, the introduction of noise becomes more prominent.Interestingly, the first comment proves to be more valuable in the context of rumor detection, as it carries more relevant information for distinguishing between rumors and non-rumors.Figures 4 and 5 showcase the attention visualization samples with the label "non-rumor" and "rumor" from the Weibo-19 and Pheme, respectively, which provides insights into the interaction between textual and visual information, and how the enhanced features contribute to debunking rumors.In Figure 4, the words "cat" and "dog" highlighted in red demonstrate high attention weights and align well with specific regions in the corresponding image.This accurate alignment con-  tributes to the prediction of the sample as a nonrumor.In contrast, in Figure 5, the words "suspect" and "MartinPlace" fail to align with their respective image regions, indicating poor alignment and predicting the sample as a rumor correctly.These observations highlight the deep semantic interaction between the textual and visual modalities within our proposed model.

Conclusion
In this paper, we propose a rumor detection framework that combines supervised contrastive learning and knowledge distillation.Our framework leverages hierarchical co-attention to enhance the representation of textual (source text and comments) and visual modalities, enabling them to complement each other effectively.The utilization of contrastive learning has proven to be successful in debunking rumors.Additionally, knowledge distillation has demonstrated its efficacy in handling incomplete modalities for rumor detection.Moving forward, our future work aims to integrate graph structures, such as social graphs, into our proposed framework for further improvement.

A Appendix A.1 Baselines
To investigate the performance of our proposed CLKD-IMRD model, we will perform comparison studies on following approaches.
EANN (Wang et al., 2018): A GAN-based multimodal model that adopts characteristics of the invariance of an event to facilitate the detection of the newly arrived events of fake news.
MVAE (Khattar et al., 2019): A variational auto-encoder-based model that captures the shared representation between textual and visual modalities.
SAFE (Zhou et al., 2020): A similarity-aware multimodal model that debunks fake news from the similarity between multimodal and crossmodal features jointly.
EBGCN (Wei et al., 2021): An edgeenhanced bayesian graph convolutional networksbased model that investigates the reliability of potential relationships in propagation structures.
GLAN (Yuan et al., 2019): An integration of local semantic and global structural informationbased model that debunks rumor.
MFAN (Zheng et al., 2022): A featureenhanced attention networks-based multimodal model that combines textual, visual, and social graphs to enhances graph topology and neighborhood aggregation processes when detecting rumor.
ChatGPT4 : A popular application showcasing the capabilities of the GPT language model is our baseline model.Since ChatGPT cannot receive image modality, we adopt the source text and the first comment as the input of ChatGPT, along with a question "judge it a rumor or not" to obtain the response, and map the results to labels (i.e.g, "yes" to rumor, "no" to non-rumor, "unable to judge" to none).

A.2 Impact of Co-attention Settings
Limited to space, Table 7 lists the performance comparison with different number of co-attention on the Weibo-19 and Pheme datasets.We consider four cases as follows.
• Zero Co-attention: In this case, no coattention is used.The representations of the source text, visual images, and comments are directly concatenated.• One Co-attention: Here, only the first coattention is employed.
• Two Co-attention: In this case, two textvisual co-attention operations are conducted.
The enhanced textual-visual representation is then concatenated with the comment representation.
• Three Co-attention: This case involves the adoption of all three co-attention operations.
The enhanced textual and visual representations are concatenated with the comment representations.
From Table 7, we observe that the performance improves with an increase in the number of co-attentions.Specifically, the Zero coattention case demonstrates the lowest performance across all three measures (accuracy, precision, F1-Score), which indicates the importance of capturing the deep interaction between textual and visual modalities through co-attentions.With the addition of one co-attention, we observe an improvement in performance as the enhanced textual representation aids in debunking rumors.As expected, the best performance is achieved when both the enhanced textual and visual representations are utilized, as evidenced by their superior results across all four measures.

Table 2 :
Performance comparison of rumor detection models on the two Chinese datasets.

Table 4 :
Performance comparison of knowledge distillation on Weibo-19 and Pheme.Performance improvement is represented by numbers in parentheses with ↑ symbol.

Table 6 :
Performance comparison with different number of comments.

Table 7 :
Performance comparison with different number of co-attention; co-att donates co-attention.