Beyond Detection: A Defend-and-Summarize Strategy for Robust and Interpretable Rumor Analysis on Social Media

,


Introduction
Due to the low cost and easy access to information, social media has become a popular platform for information dissemination.However, it increases the spread of misinformation as well (Vosoughi et al., 2018).The spread of rumors could cause panic and further damage public mental health or lead to severe economic loss (Verma et al., 2022).Therefore, debunking unverified rumors on the Internet has become an indispensable issue (Ahsan et al., 2019).Numerous researchers have been dedicated to detecting rumors automatically.Early works mostly rely on the textual content of each post and the corresponding responses (Ma et al., 2016;Volkova et al., 2017).In addition, several studies show the importance of considering the propagation path between the responses within the same conversation thread (Ma et al., 2017(Ma et al., , 2018;;Lu and Li, 2020).To better extract information from the propagation, Graph Convolutional Networks (GCNs) are widely adopted and achieve remarkable performance for the rumor detection problem (Bian et al., 2020;Wei et al., 2021;Sun et al., 2022).For instance, Song et al. (2021) pioneer the integration of transformer and GCN to better detect the rumors.
However, two main challenges remain unaddressed.Firstly, the detectors could be sensitive to critical responses toward an event, i.e., responses that significantly impact the detectors.Fig. 1 demonstrates that roughly 18.9% of posts in the Twitter15 dataset contain critical responses.The influence of such responses may be formulated as an attacking manner by adversaries.As prior studies mainly focus on determining the veracity of a given claim by the source post and responsive posts, the potential threat from the attack responses could lead to vulnerability in detection models (Le et al., 2020).Hence, some works have developed GAN-style frameworks to build more robust detectors (Ma et al., 2019(Ma et al., , 2021;;Song et al., 2021).However, retraining the entire model to defend against attacks could be time-consuming and limited to recognizing only adversarial examples, disregarding various forms of real-world malicious attacks.
On the other hand, recent works mainly leverage neural networks for predictions, making the interpretability behind those predictions unattainable due to the black-box property of such models (Ghorbani et al., 2019).To better interpret the detectors' behavior, some works utilize attention mechanism to highlight the important parts of the inputs (Khoo et al., 2020;Lu and Li, 2020), which demonstrates the feasibility of probing the detection models by identifying influential responses in a conversation thread.However, such an approach lacks comprehensive and human-understandable clues, which brings the second challenge of providing organized explanations that cover different viewpoints.We posit that the consideration of multiple perspectives within the discussion threads serves to enhance readers' cognizance of a multitude of viewpoints, thereby discouraging the uncritical acceptance of an excessively confident verdict.
In this paper, we propose a novel framework called Defend-And-Summarize (DAS) to reduce detector vulnerability and provide prediction explanations.The design of DAS follows the idea that responses with similar stances or viewpoints should lie closer in the embedding space.This concept is substantiated by prior studies (Darwish, 2019;Rashed et al., 2021), which showcase that various standpoints of political opinions on Twitter can be well partitioned into distinct clusters based on the embedding representations.This characteristic could enhance summarization with more structured and comprehensive information.As such, DAS includes a response extractor and a response abstractor.The extractor filters and organizes the responses, while the abstractor condenses the information according to the organized responses.To improve robustness, we preemptively mitigate malicious attacks with the response extractor.By exploiting the idea of anomaly detection, we filter the responses by considering the genuine ones as normal data and the attack ones as anomalies.In addition to removing the potentially risky responses, we further organize the remaining responses to find representative ones.We apply clustering to automatically explore the underlying aspects of data to interpret model predictions from different perspectives.Representative responses are then extracted from the medoid of each cluster.Afterward, the response abstractor aims to produce more comprehensive and human-understandable explanations by summarizing the responses from each cluster.We exploit the pre-trained abstractive summarizers and transfer the models to the rumor detection corpora via self-supervised learning.In particular, the abstractor is finetuned by cluster-summary pairs where the medoid of each response cluster serves as a pseudo summary.Combining extractive and abstractive summaries from DAS provides detectors and users with more reliable and comprehensive information on different viewpoints.Moreover, we introduce a Bi-directional Transformer-Graph Network (BiTGN) to improve rumor detection by integrating the robust textual representations of the transformer and the structural information of Bidirectional GCN (BiGCN).The contributions of this paper are summarized as follows: • We propose a novel framework named DAS that reduces the model vulnerability and provides prediction explanations without additional annotations and retraining of the detection models.
• We explain model predictions with extractive and abstractive summaries by incorporating the concept of clustering into self-supervised learning.
• Experiments on three public datasets show that DAS defends against attacks while producing multi-perspective explanations, and the proposed BiTGN achieves state-of-the-art rumor detection.Human evaluation further demonstrates the interpretability of the generated summaries.

Related Work
Model Vulnerability Adversarial attack has been used to simulate the impact of critical responses (Xu et al., 2021;Mehrabi et al., 2022;Xie et al., 2022).For example, Ma et al. (2019Ma et al. ( , 2021) )   adversarial responses.However, this approach requires retraining the entire model.In contrast, our work presents a novel framework that resists response attacks without retraining the model.
Interpretability One class of studies typically explains model predictions by analyzing the attention given to different parts of inputs, which is usually accomplished by visualizing the word importance scores (Ribeiro et al., 2016;Vig, 2019) or using heatmap (Samek et al., 2017).For instance, Lu and Li (2020) visualize the attention weights between source tweets and the propagation structures to highlight evidential words and suspicious users in predicting fake news.Similarly, Khoo et al. ( 2020) provide token-level and post-level explanations by examining the attention weights of transformer layers.Apart from attention-based approaches, Pugoy and Kao (2021) explain recommender system predictions by producing extractive summaries from user and item reviews, which capture crucial sentences for models and provide more comprehensive information than word-level and review-level explanations.Consequently, in this paper, we attempt to provide realistic explanations for rumor detection models by summarizing different opinions in each conversation thread.
Rumor Detection Early studies tend to verify the truthfulness of social media posts based on either traditional language processing skills (Badaskar et al., 2008;Potthast et al., 2018) or hand-crafted features (Yang et al., 2012;Liu et al., 2015;Ma et al., 2015;Wu et al., 2015).In recent years, deep neural networks such as CNN and RNN have been widely adopted to extract the text features automatically (Volkova et al., 2017;Ma et al., 2016).Furthermore, Recursive Neural Networks (RvNN) (Ma et al., 2018) and GCN-based approaches (Bian et al., 2020;Wei et al., 2021;Sun et al., 2022) are proposed to analyze the propagation structures of rumors.In addition, a recent line of studies also leverages Transformers (Khoo et al., 2020;Song et al., 2021;Tian et al., 2022) to capture longdistance interactions between responsive posts.

Problem Formulation
Here, we first define the notations.A conversation thread is denoted by X = {x i } n i=0 , where x 0 is the source post containing the main event to be verified, and {x i } n i=1 represents the responsive posts of x 0 .A graph G = ⟨V, E⟩ with vertex set V and edge set E is formed by taking each post in X as a node, and the responsive relations between nodes further define the edges.Specifically, two nodes x i and x j are connected by an edge e ij ∈ E if one of them responds to the other one.The ground-truth label is denoted by y ∈ Y = {N, F, T, U } (i.e., Non-Rumor, False Rumor, True Rumor, and Unverified Rumor).In our framework, the rumor detector aims to predict the veracity of the source post x 0 with or without attacks, while the response summarizer aims to extract k representative responses and produce abstractive summaries accordingly.

Rumor Detector
We first introduce the proposed Bi-directional Transformer Graph Network (BiTGN) for rumor detection, which integrates the advantages of the transformer network and the Bi-directional Graph Convolutional Networks (BiGCN) as depicted in Fig. 2. Previous studies have shown that Transformer-based models are more robust to outof-distribution data (Hendrycks et al., 2020) and adversarial attacks (Jin et al., 2020) compared to conventional models such as CNN and RNNs.To obtain a robust textual feature, we adopt a transformer encoder θ enc with L e layers to encode all posts in a conversation thread by concatenating them as a sequence.Specifically, the post content is first transformed into vector representations by an embedding layer.Let h The embedding of a conversation thread H (0) can be represented as follows, where ∥ stands for the concatenation.Next, the embeddings are iteratively fed into each encoder layer which consists of Multi-Head Attention (MHA) (Vaswani et al., 2017).The hidden representation at the l-th transformer layer is denoted by ).Note that since we concatenate all posts in a thread, different posts can attend to each other and exchange information during the encoding process.After the text encoding process, we obtain the node feature z i by taking mean-pooling on all the token representations of the i-th post from the last encoder layer.The hidden feature matrix is obtained as follows, To further aggregate the contextual features with the structure of responses, we leverage a GCNbased model θ gcn to capture the interactions between different posts in two directions, which consists of a Top-Down GCN (TD-GCN) and a Bottom-Up GCN (BU-GCN).Let A ∈ R n×n denote the adjacency matrix where A ij = 1 if x j responds to x i .The adjacency matrices for TD-GCN and BU-GCN are A T D = A and A BU = A T , respectively.The feature matrix is iteratively updated by each GCN layer in both directions.As such, the aggregated feature Z T D from the TD-GCN with L g layers is obtained as follows, where represents the degree of the i-th node, and W T D l−1 ∈ R d×d is a learnable matrix.Similarly, the aggregated result for BU-GCN is obtained by substituting A T D to A BU in Eq. (3).In the final step, the aggregated features Z T D and Z BU are concatenated and passed through a fully connected layer and a softmax function as follows, where W ∈ R 2d×|Y | and b ∈ R |Y | are trainable parameters and y is a vector indicating the predicted probability of each class.

Adversarial Response Generator
To simulate the attack responses from various users in real-world scenarios, we adopt an Adversarial Response Generator (ARG) proposed by Song et al. (2021).Given a conversation thread {x i } n−1 i=0 , ARG generates an adversarial response x * n that makes the detector deviate from the ground-truth y by maximizing the detection loss L det , detailed in Sec.3.5.Notably, ARG shares the encoder θ enc with BiTGN and takes the hidden representation of the last encoder layer H (Le) as inputs.Moreover, the generated response x * n is then attached to the source post x 0 to update the adjacency matrix, and its representation h * n is concatenated with the embedding of {x i } n−1 i=0 to serve as part of the encoder's inputs. 2 3.4 Defend-And-Summarize Framework Defensive Response Extractor (DRE) To defend against the attacks, we aim to filter out the attack responses simulated by ARG.We hypothesize that if one malicious response can mislead the detector, it must deviate from other normal responses in the embedding space.Therefore, we adopt an autoencoder (AE) to detect the anomalies according to the reconstruction error.Concretely, we initialize the encoder ϕ ext-e and decoder ϕ ext-d of the AE with transformer layers and train the model on normal responses.The reconstruction process of a response x i is as follows, where ϕ ext-f 1 and ϕ ext-f 2 represent fully connected layers.z ∈ R |x i |×dz is the hidden noise with dimension d z ≪ d to compress the features.We apply L 2 loss to calculate the reconstruction error and select the top-m responses with the least loss since an unseen attack response should cause a more significant error.The selection number m is determined by a pre-defined extract ratio ρ: m = ⌈ρ × n⌉.After the filtering process, we take the mean-pooling on the remaining responses {h i } m i=1 to obtain the response representations {r i } m i=1 .Afterward, k-means clustering is performed on these representations to capture the intrinsic perspectives of different responses.The responses are partitioned into k clusters by minimizing the intra-cluster sum of distances from each sample to its nearest centroid.Let {C j } k j=1 denote the set of k clusters.The extractive summary is formed by combining the medoids of all clusters, where a medoid r ext j is the response closest to a cluster's centroid c j : Finally, the embedding of the extraction result is denoted by where h ext j is the response embedding of the medoid r ext j .Note that some responses may lose their parent node after the extraction.Thus, we assign a new parent node for such response by recursively tracking back until finding a remaining node.
Self-Supervised Response Abstractor (SSRA) One main challenge of training the response abstractor is the lack of ground-truth summary labels.Inspired by previous works (Wang and Wan, 2021;Elsahar et al., 2021), we finetune our SSRA θ abs under self-supervised settings.Previous works often utilize Leave-One-Out (LOO) settings where each response in a conversation thread takes turns to be the pseudo summary.This approach follows the assumption that responses of the same thread focus on the same event, and each of the high-relevance responses can be approximated as the summary of the whole thread.However, such kind of settings suffers from a great portion of inappropriate responses-summary pairs when the responses cover various aspects.As a result, we create pseudo summaries from the cluster results obtained from DRE.Specifically, each medoid h ext j is taken as the pseudo summary, and the remaining responses of cluster C j are concatenated as the inputs for producing the summary, i.e., where h abs j is the summary embedding of cluster represents the embedding of the abstractive summary.Note that all responses in a cluster are treated as inputs during inference.Besides, each abstractive summary is attached to the source post x 0 to maintain the tree structure, where the new adjacency is denoted as A ′ .Finally, both extractive and abstractive results input along with x 0 for rumor detection:

Training Objectives
We train the rumor detector and adversarial response generator in two stages.The trainable parameters for the detector are the encoder layers θ enc and the GCN layers θ gcn , while the decoder layers θ dec are only trained for ARG.In the first stage, the generator is trained with the detector to improve the detection results and its generation quality.The objectives of the generator are the cross entropy for rumor classification L CE and the cross entropy for text generation L txt = − |x| i=1 x i log x i .The objectives of the first stage are calculated as follows, In the second stage, we train the generator with a fixed detector.The target of the generator is to produce adversarial response that degrades detector's performance while resembling human writing style.Thus, it is optimized to maximize the cross entropy for rumor detection while minimizing L txt .
(10) For the DAS framework, the trainable parameters are the response extractor ϕ ext and the response abstractor ϕ abs .The response extractor is trained to reconstruct the embedding of normal responses by minimizing the L 2 loss between the original embedding h i and the reconstructed one h i : The response abstractor is optimized to minimize the cross entropy between the generated summary s abs j and the pseudo summary s ext j , i.e., 4 Experimental Results

Experimental Setup
Datasets We evaluate our model on three realworld public datasets.Twitter15 and Twitter16 datasets (Ma et al., 2017) respectively contain 1490 and 818 Twitter posts labeled with Non-Rumor (N), True Rumor (T), False Rumor (F), and Unverified Rumor (U).Moreover, RumorEval2019 (RE2019) dataset (Gorrell et al., 2019) was released by the SemEval workshop in 2019, which contains 446 posts from both Twitter and Reddit.It provides three veracity labels, i.e., True Rumor (T), False Rumor (F) and Unverified Rumor (U).The detailed statistics of datasets are listed in the Appendix.
Evaluation Metrics For the generation quality of the response abstractor, we calculate the perplexity (PPL) by GPT-2 (Radford et al., 2019) and the factual consistency by FactCC (Kryscinski et al., 2020).For rumor detection, we report the accuracy (Acc.) over all classes, the F 1 score of each class and the macro-averaged F 1 (mF 1 ).
Baselines We compare the performance of rumor detection with several baselines.RvNN (Ma et al., 2018) captures the propagation patterns of each

Adversarial Attack and Defense
We first discuss how the proposed DAS framework can reduce detector vulnerability.Table 1 demonstrates the results of BiTGN using BART encoder, and under attack by ARG while equipping different response summarizers.Apart from the accuracy and mF 1 , we also calculate the Attack Success Rate (ASR) of ARG, defined as the ratio of successfully misled predictions among all initially correct predictions.The first and second rows represent the performance before and after the attacks.
The results manifest that ARG indeed degrades the detection performance since the attack success rates are greater than 60% on all datasets.Next, we compare the defensive ability of our response summarizers (both DRE and DAS) with different extract ratios ρ given the number of clusters k set to 3. First, the model performance has been recovered significantly by simply equipping DRE during inference, indicating that the extractor can filter out a large portion of the attack responses generated by ARG without retraining.Similarly, DAS can also defend against the attacks while additionally providing prediction explanations with abstractive response summaries.To further analyze the behavior of DAS, Fig. 3 shows the model robustness with different extract ratio ρ and number of clusters k on Twitter15. 4First of all, the Macro-F 1 increases with the extract ratio decreases and saturates when the extract ratio ρ is around 0.2.Besides, even with a high extract ratio, i.e., the left part of both figures, the model could still defend against a certain ratio of attacks while preserving more information from the responses.Secondly, increasing the number of clusters to produce more diverse summaries only slightly affects the defense ability of DAS, which demonstrates the robustness of our model.

Interpret Predictions with Summary
Automatic Evaluation Here, we show that the generated response summaries can be used to explain model predictions.Since higher text quality can help humans understand models' decisions, we first compare the generation quality of the abstractors trained under different self-supervised settings, including Leave-One-Out (LOO) and kmeans with varying values of k.We initialize all models with BART-base-SAMSum, a summarizer pre-trained on the SAMSum corpus,5 and take its zero-shot results as a baseline of text quality.Results in Table 2 manifest that self-supervised models demonstrate higher perplexity than BART-base-SAMSum due to the abbreviations and informal expressions in social microblog text, such as hashtags and URLs.Compared to SSRA-LOO, our proposed SSRA-k-means achieves better perplexity scores, indicating its ability to generate more fluent and human-understandable summaries.The inclusion of fragmented responses and incomplete sentences in SSRA-LOO training targets contributes to a higher perplexity.Secondly, we validate the factual consistency between the input responses and generated summaries.We observe that SSRAk-means outperforms baselines across different k values, suggesting the necessity of self-supervised learning.Besides, even when k = 1, i.e., SSRA-kmeans only provides one summary as SSRA-LOO does, our model scores higher than SSRA-LOO on all datasets by covering more factual information from the responses.Furthermore, the factuality improves significantly when k > 1, demonstrating the effectiveness of providing the abstractor with responses from different perspectives.
Human Evaluation and Case Study We recruit 100 human readers and conduct two parts of user study.In part A, we randomly select 10 samples from each of the three datasets, each containing a source post, responses, and two sets of summaries generated by SSRA-LOO6 and SSRA-kmeans (k = 3).Readers are requested to assess the informativeness of the summaries based on the viewpoint coverage and diversity using a Likert scale from 1 to 5, with 5 representing the most informative.In part B, we aim to evaluate whether humans make a consistent judgment, either after reading responses or response summaries.Thus, we select 20 samples, including 10 true and false rumors, and ask the participants to judge the truthfulness of source posts based on the provided information, i.e., responses or summaries.The upper of Fig. 4 shows that SSRA-k-means outperforms SSRA-LOO in terms of informativeness, indicating the effectiveness of utilizing k-means clustering to grasp diverse opinions.The results of part B show that participants who solely read the summaries achieve comparable accuracy to those who read the responses on average, with a marginal difference of approximately 5%.Although the responses provide more complete information, these findings suggest that summaries are practical for social media users to judge the post veracity as the summaries effectively capture different viewpoints in a shorter format.We notice that the model without topdown GCN (-TDGCN) improves on the classes of non-rumor and true rumor of Twitter15.This may be caused by the diverse structure of these data, as observed by Huang et al. (2020).Although the structure information may be noisy, the model can still benefit from introducing the information of propagation path through GCN layers compared with the model without GCN layers (-GCN).

Conclusion
In this paper, we propose a novel response summarization framework, Defend-And-Summarize (DAS), to enhance the robustness and interpretability of rumor detection models.

Limitations
Our work focuses on determining the truthfulness of a source post from social media websites by analyzing the structural information of its responses.Since the opinions from various users provide rich information and can influence other readers significantly, we do not consider the settings of fake news detection that solely rely on the news content.Moreover, the DRE component in our framework adopts the widely-used k-means algorithm to produce response clusters without considering specific aspects.However, it would be beneficial to create clusters with more fine-grained aspects, such as the stance or sentiment of responses.This would enable humans to gain a more comprehensive understanding of the public's opinion.We'll explore this possibility as a direction for our future work.

Ethics Statement
We discuss some potential risks that our rumor detection system might raise.As our proposed framework highly relies on the interactions between different users on social media, the content of the users' utterances including mentions to other users will be revealed to the system.However, the system doesn't require any personal information such as user description, user account age, number of followers, number of posts, etc. Due to this reason, the proposed method shall not infringe on individual's privacy.Another risk is that the detector might still give wrong classification results that mislead the users.For this issue, we believe that our method could provide a more simplified but comprehensive summary of the diverse responses under each post, making the users able to observe the opinions on more sides toward an event.This might enhance the ability of the public to rethink and justify the truthfulness of various sources of information.

A Adversarial Response Generator
Here, we provide a more detailed formulation of the adversarial response generator.To further simulate the attack responses produced by different users in real-world scenarios, we adopt an adversarial response generator (ARG) proposed by Song et al. (2021), which is trained by adversarial learning under white-box settings.We initialize ARG with a BART model due to its outstanding performance on several text generation tasks.Given a conversation thread {x i } n−1 i=0 , the goal is to generate an adversarial response x * n that makes the detector deviate from the ground-truth y by maximizing the detection loss L det , detailed in Sec.3.5, i.e., max where h * n denotes the hidden representation of x * n and A ′ is the new adjacency that attaches x * n to source post x 0 .To generate a response, we construct the ARG by sharing the encoder θ enc with BiTGN and feeding the hidden representation of the last encoder layer H (Le) to the decoder θ dec as: Note that θ out denotes the output layer, which is tied with the input embedding layer θ in .In this way, h * n can approximate the embedding of a generated response and be concatenated with the embedding of {x i } n−1 i=0 that serves as part of the encoder's inputs without taking argmax operation.Subsequently, the gradients can be backpropagated from the rumor detection loss to train the ARG.Moreover, the generated response is attached to the source post of the thread and a new edge e 0,n between x 0 and x n is thus created.

B.1 Datasets
All datasets we used are publicly available.Table 5 displays the statistics of RumorEval2019 (RE2019), Twitter15 and Twitter16 datasets.N, T, F, U represent Non-Rumor, True Rumor, False Rumor and Unverified Rumor respectively.We also calculate the statistics of the number of posts for each claim and report them as thread length in the table.

B.2 Implementation Details
All of our experiments are conducted on a single NVIDIA GeForce RTX 3090 GPU.We conduct 5-fold cross-validation and report the average results for all datasets.The total training time required for each fold on RE2019 and Twitter16 is around one hour, and two hours for Twitter15.The number of trainable parameters is around 300 million.We use the same set of hyperparameters on all datasets.Specifically, the batch size for BiTGN, ARG, and SSRA is 16, while the learning rates are set to 2 × 10 −5 .For DRE, the batch size and learning rate are respectively 256 and 4 × 10 −5 .We finetune BiTGN, ARG and SSRA for 10 epochs, and DRE is trained for 50 epochs.The number of GCN layers L g in BiTGN is 2, and the encoder and decoder of DRE both consist of 4 transformer layers.The dimension of the hidden noise z in  DRE is set to 100.Moreover, we implement our framework with Hugging Face Transformers and PyTorch.For the results of rumor detection in section 4.4, the transformer encoder of BiTGN is initialized with RoBERTa-base8 .For the results of adversarial attack and defense in section 4.2, the ARG shares the same encoder with BiTGN, and the overall encoder-decoder framework is initialized with BART-base9 .The SSRA is initialized with BART-base-SAMSum10 .We also follow (Song et al., 2021) to perform tree decomposition on the original datasets where each conversation thread is decomposed into several subtrees by adding each response one by one in chronological order.In this way, we not only increase the amount of training data for BiTGN but also create a pseudo-groundtruth response for ARG from the last response of each subtree.For the baseline models of rumor detection, we use the official implementation for RvNN11 , BiGCN12 , EBGCN13 and DUCK ¬UT14 .For WETGN 15 , we implement the model architecture by ourselves due to the similar design.For the evaluation of factual consistency, we use the official implementation of FactCC16 .

C Additional Results of Rumor Detection C.1 Models with Different Backbones
We provide the results of WETGN and BiTGN using different transformer encoders, including RoBERTa with 12 self-attention layers and BART encoder with 6 self-attention layers.The results are shown in Table 8.It is obvious that both WETGN and BiTGN perform better with RoBERTa encoder, which is expected since RoBERTa contains more layers than BART encoder.Moreover, RoBERTa is pre-trained on several text classification tasks, while BART is more effective on text generation tasks.We also report the rumor detection results of BiTGN after each adversarial training stage.For the first stage (BiTGN † ), the inputs of the detector contain the response generated from ARG, and the second stage (BiTGN*) contains only the responses from original data.We can observe that the model performs better in the first stage, which indicates that the generated response in the first stage can help the detector improve its accuracy.

C.2 Detailed Detection Results
Table 6 provides the detailed results of rumor detection, including the F 1 score for each class.The results demonstrate that BiTGN outperforms all baselines listed in the first block, demonstrating  its power contributed by the robust textual representations from the transformer network and the effective graph aggregation from the BiGCN component.Moreover, as discussed in section 4.4, we see that BiTGN has a lower F 1 score on the classes of non-rumor and true rumor of Twitter15, which may be caused by the diverse structural information of this dataset as observed by Huang et al. (2020).To validate the significance of the detection results, we perform the paired student t-test between the proposed BiTGN and each of the baseline models.The corresponding p-values for both accuracy and macro-averaged F 1 scores are presented in Table 7. Notably, our BiTGN not only achieves the best average accuracy and F 1 scores as discussed previously but also significantly outperforms models utilizing TF-IDF vectors as node features (BiGCN, EBGCN, RvNN).We notice that the significance levels for the comparisons with WETGN and DUCK ¬UT are relatively lower.This could be attributed to their shared utilization of a transformer backbone and the notable variability in folds.Nonetheless, in comparison to WETGN, which incorporates a top-down GCN with weighted edges, our BiTGN effectively benefits from the bi-directional GCN component, even in the absence of weighted edges.When compared against DUCK ¬UT , which employs two distinct branches of transformers to independently model each conversation thread as both a stream and a graph, our model achieves a more favorable average performance by utilizing only a single transformer branch, thereby resulting in fewer parameters.

D Adversarial Attack and Defense on Other Datasets
We investigate the robustness of DAS under different extract ratio ρ and number of clusters k on RE2019 and Twitter16 datasets, as illustrated in Fig. 5.The dashed lines represent the detector's performance without equipping the DAS framework.Specifically, the red line and green line stands for the performance of the model with and without being attacked respectively.The results of all datasets are intuitive and similar, where the F 1 score increases as the extract ratio ρ decreases, and the Attack Success Rate (ASR) behaves in an opposite trend.Moreover, increasing the number of clusters k to provide responses from more perspectives do not affect the model performance drastically.

E Additional Results of DAS Framework E.1 Rumor Detection with DAS
To observe how the proposed response summarization framework affects the performance of rumor detection, we provide the detection results of BiTGN equipped with the variants of DAS without being attacked in Table 9. DRE † denotes DRE with the autoencoder (AE) only.In this experiment, BiTGN is initialized with BART-base, and the number of clusters k is set to 3, i.e., DRE extracts 3 responses, and DAS further produces 3 abstractive summaries based on the 3 clusters.Firstly, we observe that both DRE † and DRE can approximate the model's performance.Notably, DRE even scores higher on the Twitter15 and Twitter16 datasets, potentially due to the filtering process, which identifies and removes noisy responses that could degrade the model's performance.Moreover, DRE achieves comparable results even if only 3 responses are selected, indicating that the extractor can effectively capture the representative responses from each conversation thread.As such, the extracted responses can be used to interpret the model's behavior.Next, for all summarizers, the performance rises as the extract ratio ρ increases and saturates at different ratios on different datasets, which suggests that the representative responses could be potentially excluded if we filter out too many of them.Lastly, we also find that DAS slightly degrades the detection performances, possibly due to the distribution shift of abstractive summaries as the abstractor is tuned under self-supervised settings without fine-grained ground-truth labels.Furthermore, the unsatisfying quantity of data also indicates that there is still room for improvement.We consider enhancing the text quality under self-supervised settings as a potential avenue for future research.

E.2 Human Evaluation
In this section, we further provide the settings and additional results of human evaluation.We recruit 100 human readers for both parts of the human evaluation.We additionally insert a trick question in part A to validate the answer quality of each participant, and each participant will be paid $5 if he / she passes the trick question.To verify the consistency of the collected results, we calculate the Fleiss' Kappa to evaluate the inter-rater reliability of the human evaluation.For part A, we calculate the Fleiss' Kappa to assess the readers' agreement toward rating our model more favorably than the baseline, and the score is 0.3321.In part B, for readers that make predictions based on the responses and summary, we obtain the Fleiss' Kappa scores of 0.5341 and 0.4225, respectively.Following the criteria outlined by Landis and Koch (1977), these values indicate a fair and moderate agreement among participants in part A and B. It's important to note that this level of agreement has been established as reliable in prior research studies (Cao and Wang, 2021;Chen et al., 2021).In part A, we select SSRA-LOO as the baseline and compare its informativeness with SSRA-k-means (k=3).Since SSRA-LOO generates only one summary by default, we randomly divide each conversation thread into 3 groups and make SSRA-LOO generate one summary for each group for a fair comparison.We ask the participants to rate each set of summaries based on the following scoring strategy: • Score 1: None of the three summaries accurately capture the information from the responses, and the summaries may repeat the same information or contain unrelated information.
• Score 2: Only one summary captures the information from the responses, but the other ones may be incomplete, inaccurate, or repetitive.
• Score 3: Most of the summaries accurately capture the information from the responses, but some important perspectives may be missing, or some information may be repeated unnecessarily.
• Score 4: All three summaries accurately capture the information from the responses but may not fully cover all important perspectives or provide a nuanced understanding of the issue.
• Score 5: All three summaries accurately capture the information from the responses, and the summaries provide a comprehensive and nuanced understanding of the issue.The summaries cover diverse perspectives and avoid repetition.
In part B, we select 20 samples from RE2019 and Twitter15 datasets, with 5 true rumors and false rumors from each of them.The participants are asked to determine whether a source post is true or false based on all its responses or a set of summaries.
A key idea is to observe whether the responses or summaries deny the rumor, but the readers are not required to accept all the utterances and can make their decisions based on their intuition after reading the provided information.We visualize the predictions based on the responses and summary in Fig. 6.Notably, a higher number of individuals accurately predict the veracity (p gt > 0.5) of 15 and 17 samples based on the responses and summary, respectively.Despite the slightly lower average accuracy of the summary, as shown in Fig. 4, the results still indicate its effectiveness in providing social media users with essential information from the responses.Additionally, we observe a high correlation between the predictions obtained from the responses and summaries in most cases.This emphasizes the interpretability of the summaries, as we can identify the crucial information that the detection models focus on when making predictions.

E.3 Generation Examples
We demonstrate more examples by DAS in Table 10, and 11.QID corresponds to the Question ID of part B human evaluation, as shown in Fig. 6.We provide the source post, responses, and both extractive and abstractive summaries for each example.The summaries can not only explain predictions from rumor detection models but also benefit social media readers and make them quickly understand the public's opinions toward specific events.
Source Post (False Rumor): a claim that #obama used the #shutdown to scuttle the amber alert system reveals an ignorance about amber alerts.URL Responses (Cluster 1) [2]: "@name1 : Claim #Obama used #shutdown to scuttle AmberAlerts reveals ignorance abt AmberAlerts URL" @name2 [5]: RINO!!! RT @name1 A claim that #Obama used the #shutdown to scuttle the Amber Alert system reveals an ignorance about Amber Alerts.
[4]: @name1 @name4 That's not the claim.The claim is that Obama is trying to scare people by appearing to shut down sites.
[7]: @name1 @name5 I'm really sick and tired of seeing more of this fake news finding traction; how stupid are these readers?[8]: @name1 @name6 Why care at all?? Maybe the little scamps wandered off to look for some food since their SNAP was cut.Just a guess.
[11]: @name7 I'm willing to bet there's a high probability that the people who believe this might also believe Obama is Kenyan [12]: @name1 @name8 These lies about the #GOPShutdown are not isolated.They're part of RNC strategy of distraction URL Extractive Summary [1]: @name1 No it doesn't It shows how ignorant ppl are for believing stupid S # % t #SHUTDOWN [2]: @name1 Just got asked to "inform" someone about this -since he/she can't be bothered to inform self.URL [3]: @name1 @name4 Perhaps , but a smart administration wouldn't have put that notice on the site.I have no sympathy.
Abstractive Summary [1]: Claim #Obama used #shutdown to scuttle Amber Alerts reveals ignorance abt Am-berAlerts [2]: I don't know how to inform someone about this -since he/she can't be bothered to inform self.
[3]: That's not the claim.The claim is that Obama is trying to scare people by appearing to shut down sites.
[18]: @name1 @name6 Krispy Kreme is headquartered in the American South, (Winston Salem, North Carolina) sooo... [20]: @name1 @name7 The Kicker is that this was advertised to children as Krispy Kreme Klub, teaching racism and poor spelling all in one.(Cluster 3) [7]: @name1 Hi, we know we got it wrong & wholeheartedly apologise.We're taking steps to make sure it doesn't happen again [8]: @name1 They're dull in Hull and the Isle of Mull is seething with discontent.
[10]: @name1 GG Hull, glad to know the 2017 city of culture is trying to be inclusive of all groups [13]: @name1 Like the article said.It was a poor choice of the play on the word Club (*spelled Klub) Their intentions were good though.
[15]: @name5 @name1 you're going to take step to make sure you don't create any more sales events with unbelievably racist names?Okay.
[3]: I'm sorry to hear this, but it was a poor choice of the play on the word Club (*spelled Klub).
Table 10: Generated examples of DAS (k=3).QID corresponds to the Question ID in Fig. 6.The responses are arranged in different clusters and chronological order.Key information captured by summaries is highlighted with different colors for each cluster.The responses within the same cluster deliver similar information, and the produced summaries can effectively capture essential information from the responses.
Abstractive Summary [1]: NOOO!!!... Table 11: Generated examples of DAS (k=3).QID corresponds to the Question ID in Fig. 6.The responses are arranged in different clusters and chronological order.Key information captured by summaries is highlighted with different colors for each cluster.The responses within the same cluster deliver similar information, and the produced summaries can effectively capture essential information from the responses.

Figure 1 :
Figure 1: Three examples for the predicted probability of each class with respect to the responses on the Twitter15 dataset.The curves with their face colored represent the ground-truth labels for their source post.Critical responses that result in prediction shifts larger than 0.5 are marked with a red circle.

Figure 2 :
Figure 2: Overview of our proposed framework (upper left).The rumor detector BiTGN (upper right) is trained to predict the veracity of each source post.The response summarizer DAS (lower) preemptively filters out attack responses generated by the response generator.It then organizes the remaining responses into k clusters and produces both extractive and abstractive summaries for each cluster accordingly.During the inference phase, the detector makes predictions based on the source post and the summaries.

Figure 3 :
Figure 3: Effect of extract ratio ρ and number of clusters k on Twitter15.The dashed lines represent the detection performance without DAS.The Macro-F1 (left) increases as the extract ratio ρ decreases and the Attack Success Rate (right) behaves in an opposite trend.The number of clusters k does not influence the results significantly, which demonstrates the robustness of DAS.

Figure 4 :
Figure 4: Human evaluation of generated summaries.In part A, our SSRA-k-means model can generate more informative response summaries compared to SSRA-LOO.In part B, human predictions based on either responses or summaries can achieve comparable accuracy, which demonstrates the interpretability of the summaries.

Figure 5 :
Figure 5: Effect of extract ratio ρ and number of clusters k on RE2019 (left) and Twitter16 (right).The dashed lines represent the detection performance without DAS.The Macro-F 1 increases as the extract ratio ρ decreases on both datasets, and the Attack Success Rate (ASR) behaves in an opposite trend.Moreover, the number of clusters k do not influence the results significantly, demonstrating the robustness of DAS.

Figure 6 :
Figure 6: Visualization of the number of predictions based on the responses / summary for each sample of human evaluation part B. Ground-truth label for each sample is marked with "+".In most cases, predictions based on the responses and the summaries are highly correlated, demonstrating the interpretability of the summaries.

Abstractive
Summary [1]: RT: NEW.Leaked phone call between rebel leader & Russian intel agent: "Cossacks" shot down #MH17 [2]: This is a hoax, it was a hoax of sorts and you promoted it.Shame on you [3]: what is the proof?? 16 Source Post (True Rumor): microsoft is reportedly buying 'minecraft' developer mojang for $2 billion URL [2]: Microsoft is reportedly buying 'Minecraft' developer Mojang for $2 billion [3]: I think Minecraft is a great game.

Table 1 :
Overall results of adversarial attack & defense on BiTGN.Diff.
ASR represents the difference of ASR with and without defense.Both DRE and DAS can successfully resist a large amount of attacks from ARG.

Table 2 :
Automatic evaluation of generated summaries.
(Bian et al., 2020) using tree-structured recursive neural networks with GRU units.BiGCN(Bian et al., 2020)represents each post with TF-IDF vectors and utilizes a bi-directional GCN to aggregate both propagation and dispersion structures.PPL ↓ FactCC PPL ↓ FactCC PPL ↓ FactCC BART-base-SAMSum 0.63 The best / second best scores are marked in bold / underlined.Our SSRA-k-means models generate summaries with better text quality and factual consistency.

Table 4 :
In real applications, we could provide both responses and summaries to convey the essential viewpoints by summaries and delve into details in the responses.Moreover, we evaluate the percentage of ground-truth predictions p gt for each sample and analyze the correlation Overall results of rumor detection.The best / second best scores are marked in bold / underlined.Our BiTGN outperforms all baselines in the first block.between p gt of responses and summaries.We observe a Pearson correlation of 0.54 with p-value 0.014, justifying a high correlation between predictions based on responses and summaries.This demonstrates that the response summaries effectively capture crucial information and can interpret human decisions based on the responses.Table 3 demonstrates a generation example of SSRA-k-In this section, we purely analyze the rumor detection results of the proposed BiTGN with RoBERTa encoder in Table 4. 7 Compared with all baselines, our BiTGN achieves the best accuracy and macro-averaged F 1 on all datasets.Specifically, transformer-based models (BiTGN, DUCK ¬UT , WETGN) outperform models that use TF-IDF vectors (BiGCN, EBGCN, RvNN) as node features, which demonstrates the importance of robust textual representations.Moreover, our model strikes the best among transformer-based baselines, showing that the BiGCN component can better aggregate the conversation information from two directions.Besides, compared to DUCK ¬UT that models the conversation structures with two branches of transformer networks, our model still achieves better results with fewer parameters.We also analyze the influence of GCN by ablating the BiGCN as shown in Table 4.In particular, the model with BiGCN (BiTGN) achieves the best performance on RE2019 and Twitter16 while achieving the second best on Twitter15.
means (k = 3).The source post contains a false claim about a "Malaysian flight shot down by Cossacks".Our summaries encompass diverse stances such as "It was a hoax" (deny) and "what is the proof " (query), providing evidential guidance for models and users to evaluate the veracity of the source post.These summaries also help identify which information models focus on.

Table 7 :
Paired student t-test between our BiTGN and other baselines for rumor detection.The p-value of both accuracy (Acc.) and macro-averaged F 1 (mF 1 ) are presented.

Table 8 :
Performance of BiTGN and WETGN with BART and RoBERTa as transformer encoder.RoBERTa encoder improves the performance on all models.