Adversary-Aware Rumor Detection

While social media becomes a primary source of news now, it also becomes more challenging for people to distinguish the rumors and non-rumors, which attracts malicious manipulation and may lead to public health harm or economic loss. Consequently, many rumor detection models have been proposed to automatically detect the rumors based on the contents and propagation path. However, most previous works are not aware of malicious attacks, e.g., framing. Therefore, we propose a novel rumor detection framework, Adversary-Aware Rumor Detection including Weighted-Edge Transformer-Graph Network and Position-aware Adversarial Response Generator, to improve the vulnerability of detection models. To the best of our knowledge, this is the ﬁrst work that can generate the adversarial response with the consideration of the response position. Experimental results show that our model achieves the state-of-the-art on various rumor detection tasks by the proposed Weighted-Edge Transformer-Graph Network and can maintain the performance under the adversarial response attack after the adversarial learning by Position-aware Adversarial Response Generator. 1


Introduction
With the popularity and accessibility of social media, social media becomes the primary source for obtaining information. 2 Compared with traditional news, posts on social media are usually with shorter lengths and faster transmission speed, which also increases the difficulty of message verification. As such, social media are increasingly targeted for manipulation, leading to tremendous economic losses, 1 The codes are released as a public download at https: //github.com/yunzhusong/AARD. 2 https://pewrsr.ch/3nzYpQd Figure 1: Position test for an adversarial response. The reply position can make an influence on the detection model. and even deaths. 3 Take the COVID-19 pandemic as an example, a newly published study shows that at least some 800 people died because of a rumor about drinking highly-concentrated alcohol can disinfect their bodies (Islam et al., 2020). Therefore, fighting against misinformation in social networks gains a great deal of attention and becomes essential and inevitable.
In this paper, we study the problem of rumor detection on social media, where a rumor is defined as an unverified and instrumentally relevant information statement in circulation (DiFonzo and Bordia, 2007;Zubiaga et al., 2018). Given a conversation thread, including a source post and related responses, the rumor detection task aims to determine whether the source post is a rumor or not. Previous works of rumor detection can be categorized into three classes according to the data usage. Content-based approaches only use the textual information of the source posts and the user responses (Ma et al., 2016(Ma et al., , 2019, while graph-based methods (Ma et al., 2017(Ma et al., , 2018Bian et al., 2020) consider the message propagation paths or model the propagation paths as a tree. In addition to considering the textual information and propagation paths, user-based methods take user profiles into consideration (Giudice, 2010;Liu et al., 2015).
However, two challenges of detecting rumors have not been completely addressed. 1) Robustness to different responses. Previous works of rumor detection take all the responses in the conversation thread into consideration and extract important information in a data-driven manner. However, not all responses can help detect rumors, especially for the malicious framing responses, i.e., the responses promote a particular misleading interpretation. As such, it is necessary to provide a learning mechanism to enable the selection of important responses.
2) Vulnerability to malicious attack. Most of the existing methods only rely on datasets, which may be vulnerable to the adversarial attack, e.g., the attack of Twitter bots. (Ma et al., 2019) make the first attempt to utilize a GAN-based approach to produce adversarial text. Nevertheless, it does not consider the graph structure of the conversation thread, i.e., the generator cannot determine which responses it should reply to. In fact, the reply position can make an influence on the detection model. As shown in Figure 1, the predicted probability that the source post is a rumor ranges from 30% to 70% according to different positions of attached adversarial responses. It is challenging to generate an adversarial attack by simultaneously considering both structural and textual information since the gradient-based methods cannot be directly applied due to the discrete nature of text and structure.
To address these two challenges, we propose a novel framework, namely, Adversary-Aware Rumor Detection (AARD), which includes i) Weighted-Edge Transformer-Graph Network (WETGN) and ii) Position-aware Adversarial Response Generator (PARG). Specifically, given a source post, responses, and propagation structure as an input, we use a transformer-based encoder to encode each token in the whole conversation thread to exploit the existing pre-trained knowledge. Each token can jointly attend to different tokens regardless of the token position, which gives the model the flexibility to break the distance limit in the sequence. Since the transformer layer only takes the responses in the conversation thread as a sequential input, the propagation path is not considered. Therefore, a Graph Convolutional Network (GCN) is applied to embed the structure by taking the token embeddings as node features, and aggregates the features according to the propagation paths. Inspired by (Veličković et al., 2018), we construct the edge features from the incident nodes and build an edge filter before the GCN layers to address the first challenge. As such, we can leverage the advantages of both transformer and graph neural networks.
Moreover, to address the second challenge, we build a Position-aware Adversarial Response Generator (PARG) to train the detector by adding an adversarial response to the conversation thread. Specifically, based on a transformer-based encoderdecoder framework, PARG takes the source post with part of the corresponding responses as input to generate an adversarial response. Nevertheless, choosing the attached position for the structureaware detection model is also crucial. PARG is trained to select the position by considering the correlation between the generated response and each of the existing posts. However, the position selection involves the argmax function, which is a non-differentiable operation. Therefore, to enable the backpropagation of gradients from the detector, PARG instead predicts the probabilities of attaching the generated response to each existing response. When updating the edge weights of the attached edges in the detection model, the generator can use the gradient to correct the predicted probabilities.
By fine-tuning WETGN with the adversarial data generated by PARG, WETGN is equipped with a certain degree of resistance to attack and maintains the performance on clean datasets. Nevertheless, although an attacker can generate adversarial examples with the detection model, it may create nonsense sentences, which can be manually excluded (noticeability). On the other hand, imposing constraints on the generated examples decreases the possibility of finding effective adversarial examples (success rate). This paper designs a training pipeline to strike a balance between success rate and noticeability. As such, the attacker is trained to decrease the detection performance and approach the real responses simultaneously.
Extensive experimental results manifest that the proposed WETGN outperforms state-of-the-art approaches on three rumor detection benchmarks by at least 4.9%, 2.89%, and 3.87% on Twitter15, Twitter16, and Pheme datasets. At the same time, AARD can resist the adversarial attack. Moreover, the proposed PARG can successfully attack existing detection models with a success rate of at least 25.08%. The success rate can be significantly reduced after fine-tuning, which shows the compatibility and usefulness of PARG.

Related Work
Early works rely on textual content to verify the authenticity of social media posts. For example, Badaskar et al. (2008) quantify the frequency of uncommon phrases in the articles and syntactic and semantic checking, while Potthast et al. (2018) detect the truthfulness of news by analyzing its writing style. Ma et al. (2016) use recurrent neural networks to learn both the temporal and textual representation of the source posts and user responses, which highly improves prior methods that utilize hand-crafted features. Also, Volkova et al. (2017) extract text features with LSTM and CNN structures to make the prediction.
On the other hand, a recent line of studies focuses on automatically detecting rumors based on the tree structure of the conversation thread (Ma et al., 2017;Kumar and Carley, 2019;Lu and Li, 2020). For instance, Ma et al. (2018) build a tree-structured recursive neural network to catch the hidden features from either topdown or bottom-up propagation structure and text content. However, it can only obtain the information of one propagation structure and ignore the other. To solve this problem, Bian et al. (2020) use the GCN-based model to embed both propagation and dispersion structures and enable the proposed method to process graph/tree structures and learn higher-level representation more conductive to rumor detection. Besides, by utilizing the hierarchical structure in the conversation thread (i.e., parent, child, before, after and self), Khoo et al. (2020) adopt the idea in Shaw et al. (2018) to perform structure-aware self-attention.
In addition, stance and user information are also used in several studies. By using stance prediction as the auxiliary task with multi-task learning, ; Li et al. (2019); Kumar and Carley (2019) have demonstrated that stance prediction plays a vital role in rumor detection. Furthermore, Li et al. (2019) incorporate the collected user credibility to supervise the detection model. Lu and Li (2020) construct the propagation network by using retweet sequences of users with user profiles to capture the correlation between user propagation and its source post. The uniqueness of our work lies in improving the vulnerability of detection models.
Due to the small or non-diversified training data, a recent line of studies utilizes the adversarial learning (Ma et al., 2019;Yang et al., 2020) or data augmentation to improve the detectors. For example, Ma et al. (2019) propose an RNN-based GAN model, where the generator aims to generate conflicting information in the conversation thread, and the discriminator is forced to learn more robust features. On the other hand, Han et al. (2019) augment data by using semantic relatedness to assign pseudo labels to unlabeled tweets. However, the structural information is important but not considered in these previous works.

Problem Formulation
Given a conversation thread comprised of a source post and the corresponding responses, rumor detection aims to determine whether the claim of the source post is a rumor or not. Let X = {x 0 , x 1 , · · · , x i , · · · , x N } denote a conversation thread, where x 0 represents the source post and {x i } N i=1 represents the N responses. A graph G = V, E is constructed by taking each element in X as a node and the interactions between elements as the edge connections to form the node set V and the edge set E, respectively. For example, if nodes x u and x v have a direct interaction (e.g., commenting or retweeting) in the same conversation thread, an edge (x u , x v ) ∈ E is constructed accordingly. Due to the nature of social media, the graph G is an acyclic tree. Let y ∈ {rumor, non-rumor} be the class label. Rumor detection aims to predict y given the graph G.

Rumor Detection Model
Transformer Encoder: To obtain the representation of text contents, we adopt the transformerbased encoder to explore the pre-trained knowledge. We first flatten the tree-structured graph in the chronological order, which constitutes a source post followed by a sequence of responses. Specifically, the source post and each response are started by a special token [CLS] and ended by another special token [SEP] to indicate the separation of nodes (Liu and Lapata, 2019). In this setting, we allow each token to jointly attend to nodes in different positions for better capturing semantics. Let h (0) i ∈ R |x i |×d denotes the d-dimensional embedding of a node x i , which is constructed by token embedding, segment embedding, and position embedding (Devlin et al., 2019). The embedding of a conversation thread H (0) can thus be obtained as follows: where || is the concatenation operation, and M = |x 1 | + |x 2 | + ... + |x N | indicates the length of input sequence. The embedding is passed through several transformer layers. At layer l + 1, the features from previous layer H (l) is transformed by three linear layers to form the query Q, key K, and value V matrices, and the output H (l+1) is computed as follows: where W q , W k , and W v are trainable parameters and d k is the scaling factor to prevent small gradients. The feature of token [CLS] from the last layer is taken to represent each node, which are denoted The interactions between responses, e.g., commenting or retweeting, are essential information for the detection model to judge the source post (Castillo et al., 2011). The responses not only contain the users' opinions but also reveal the propagation paths through social media. Since Graph Convolutional Network (GCN) is one of the most effective models for graph-structured data modeling, we leverage GCN to consider the propagation path. The message propagation function of a multi-layer GCN defined in the first-order approximation of Chebyshev polynomials is derived as follows: is a learnable matrix, and Z (0) = Z is the node features. Although GCN has been proved to be effective for extracting structural information, the information may not be faithful after aggregation with specific nodes, e.g. framing responses. Therefore, considering the potential existence of various redundant or adversarial messages, we propose to filter the edges by learning the importance of edges before the aggregation. 4 Specifically, based on the node features extracted by the transformer encoder, the importance of an edge e u,v between nodes x u and x v is constructed as follows: where W edge and b edge are trainable parameters. The predicted importance are then used to construct a weighted adjacency matrix as follows: For the final prediction, the model considers the entire graph by taking mean pooling over all convolved node features instead of only taking the root node's attribute. The prediction is calculated from the feature matrix of the last GCN layer L: where W o and b o are trainable parameters.

Adversarial Response Generation
To further improve the vulnerability of the detection model, we explore adversarial learning under the setting of white-box attack, i.e., the parameters and gradients of the detector are exposed when updating the attacker. Specifically, we design a response generation model that attaches new adversarial responses to the conversation threads as an attacker against the detection model. For the text generation model, we adopt an encoder-decoder framework with transformer layers, which shows outstanding performance in text generation (Liu and Lapata, 2019). However, the gradients cannot be backpropagated from the detector to the generator for updating, due to the non-differentiable argmax function (de Masson d'Autume et al., 2019) in generation. To solve this problem, we tie the generator's output layer E out with the embedding layer E in , which means the weights of the two layers are mutually transposed. In this way, the features before the argmax function can be treated as the embedding of the generated response. Given an input sequence {x i } n−1 i=0 , a response is generated as follows: h n = f dec (f enc (h 0 , · · · , h n−1 )), x n = argmax(sof tmax(E out ( h n ))), where f enc and f dec are the encoder and decoder. The features h n can thus be directly used in the detector without breaking the gradient path. Besides, to reduce the model complexity, the encoder is shared between the generator and detector. Nevertheless, for rumor detectors that incorporate propagation paths, the location to attach the generated responses is also a crucial problem. Similar to text generation, the operation of choosing the attached position is also discrete. To enable the model to simultaneously learn the position for generating responses, the generation model additionally predicts the edge weights {e n,i } n−1 i=0 between the generated response x n and all existing nodes in the training process with Gumble softmax function (Jang et al., 2016), i.e., where g i is i.i.d sampled from Gumble distribution (0,1), τ is a hyper-parameter for controlling the smoothness of output distribution, and W p ∈ R 2d×1 and b p are trainable parameters. A higher edge weight indicates a higher possibility that the attack can succeed at a specific position. It is worth noting that we focus on generating only one response to attack the model for validating the performance of the proposed attack. The proposed model can be extended to iteratively generate adversarial responses at different positions in the conversation thread.

Training Pipeline
The adversarial examples cannot only demonstrate the weakness of the detection model, but also provide the opportunity to improve the vulnerability. However, there is a trade-off between the attack success rate and the noticeability. To strike a balance between them, the attacker should generate a response that is close to the real ones. Base on this idea, we i) decompose the conversation threads into several subtrees for the attacker to predict the next real response and ii) design a three-stage learning pipeline to mutually learn the attacker and the detector. Firstly, the generator is trained along with the detector to increase the detection accuracy. To generate quality responses, we provide the generator target sentence by decomposing one conversation tree into several subtrees. That is, given a subsequence of the conversation thread, the goal of the generator is to synthesize the next real response x. We only train the decoder layer θ dec of the generation model while fixing the parameters of the encoder. For the detection model, the trainable layers are the encoder θ enc layer, filter layer θ f ilter and the GCN layer θ gcn . The objective function of generator is the binary-cross entropy for rumor classification and the cross entropy for text perplexity L txt = − 1 |x| |x| m=1 logP gen (w m |w 1:m−1 ), while the detector minimizes the rumor classification loss. The loss of the first stage (L 1st ) is derived by summarizing L gen and L det with a weight λ: L gen (θ dec ) = CE(ŷ, y) + L txt , L det (θ enc , θ f ilter , θ gcn ) = CE(ŷ, y), The second training stage is to train the generator while fixing the detector. In this stage, the goal of the generator is to generate a response that can confuse the detector as an attacker. The detector takes the adversarial data as the input and makes a prediction, and the target label is reversedȳ, i.e., the rumor becomes non-rumor and vice versa. To make the generated text unnoticeable, i.e., similar to human written sentences, the attacker is also trained to optimize L txt . Therefore, the loss of the second stage is The third training stage is to fine-tune the detector under the fixed attacker. The detector is trained on the adversarial data and optimized to make the correct prediction. This training equips the detector with the ability to resist the attack and also learn to filter out the potential redundant or adversarial messages. The objective function is as follows: L 3rd = L det (θ enc , θ f ilter , θ gcn ) = CE(ŷ, y).

Experiment Settings
Datasets. We evaluate the proposed AARD on three public datasets including Pheme (Zubiaga et al., 2016), Twitter15 and Twitter16 (Ma et al., 2017) datasets since these datasets contain source posts, the corresponding responses, and the rumor labels. The original labels of Twitter15 and Twit-ter16 datasets include four classes, i.e., true rumor, false rumor, unverified rumor, and non-rumor. In this paper, we focus on differentiating rumors from non-rumors, and thus regard the first three classes as rumors. The Pheme dataset is collected based on five events with two classes, i.e., rumor and non-rumor. Due to the privacy protection policy of Twitter, the contents of responses are not included in the dataset. Therefore, we crawl the contents of responses by ourselves. If all contents have already been removed, we delete the empty tweet from the tree. Meanwhile, following the previous work (Khoo et al., 2020), we also eliminate retweets with an empty text description. The statistics are shown in Table 1.
Baselines. The selection of the baselines follows two criteria: 1) "rumor detection" or "rumor veracity classification" and 2) availability of source codes. Specifically, this paper designs a rumor detector and generator for the "rumor detection" task, which is a binary classification task. In contrast, the "rumor veracity classification" is a four-class classification task (non-rumor/truerumor/false-rumor/unverified rumor). As Ma et al. (2019) also target the binary classification task between rumor and non-rumor, it is selected as baselines. For other works focusing on rumor veracity classification (Ma et al., 2018;Kumar and Carley, 2019;Yang et al., 2020;Khoo et al., 2020;Bian et al., 2020), one possible way for comparing with these works is to reimplement the models and change their settings to the binary classification. Therefore, Bian et al. (2020) and Ma et al. (2018) are used as baselines by reimplementing and changing the labels as binary classification. Unfortunately, it is hard to compare with some baselines that do not release the source code (Yang et al., 2020) or require additional information, e.g., user information and the stance of each response (Kumar and Carley, 2019). Finally, the baseline methods are listed: (1) RvNN (Ma et al., 2018), based on tree-structured recursive neural networks with GRU units to obtain representations from the propagation structure in the bottomup (BURvNN) or top-down (TDRvNN) manners, (2) GAN-GRU (Ma et al., 2019), the GAN-style learning model where the discriminator and generator are recurrent neural networks with GRU units, (3) BiGCN (Bian et al., 2020), the GCN-based model that can embed both propagation and dispersion structures and enhance the root node features, and (4) GCAN (Lu and Li, 2020), which learns the retweet propagation features based on user features by a structure employed convolution and recurrent neural networks. Implementation Details. We use the same hyperparameters for all datasets. Specifically, the batch Table 2: Rumor/non-rumor detection results. The '-EF' means the model without the edge filter, and the '-DD' is trained without the data decomposition, while the '-PARG' indicates the detector without adversarial learning.  sizes of the detector and the generator are 48. The learning rate of the generator is 0.002 and warms up for 2000 steps. The learning rate of the decoder is set to 0.002. The token embeddings are initialized from BERT, therefore, the settings refer to the pretrained model bert-base-uncased (Devlin et al., 2019). The Transformer Encoder has 12 selfattention layers, and the layer number of GCN (L) is 2. The loss weight λ is 0.5. Evaluation metrics. The evaluation metrics include accuracy, precision, recall and F1 score of two classes. We split each dataset into five-fold (80% for training and 20% for testing), and report the average results.

Rumor Detection
Overall Performance. Table 2 shows the performance of all models on the rumor detection task. The results manifest that the proposed AARD outperforms all the state-of-the-art models by at least 4.9%, 2.89%, and 3.87% on Twitter15, Twitter16, and Pheme datasets, respectively. Compared with the methods that use the recursive (BURvNN and TDRvNN) or recurrent (GAN-GRU) neural network, graph-network based models achieve better results, indicating that the propagation structure contains important information when detecting rumors. Different from the BiGCN, which uses the tf-idf vectors as the node features, AARD uses selfattention layers to encode the posts as the node features. The Transformer encoder enables the model to embed tokens across nodes, thus strengthening the node representation.
Moreover, the bottom rows of Table 2 show the ablation studies of the proposed AARD. The results show that both edge filter and data decomposition play important roles. On the other hand, the model can achieve a promising performance on the rumor detection task without adversarial learning. The goal of adversarial learning is to address the second challenge, i.e., vulnerability to malicious attacks. Accordingly, the Position-aware Adversarial Response Generator (PARG) is designed to improve the robustness under a malicious attack. As the original testing dataset is clean (without attacks) or only contains few manual attacks, the detection accuracy may not be significantly improved. However, when a detector is without adversarial learning (denoted by AARD-PARG in Table 2 and WETGN in Table 4), the performance drastically decreases when encountering an attack (adding one adversarial node to the conversation tree), which can be alleviated by fine-tuning on the adversar-     To further analyze the impact of data quantity on model performance, we train the models under different quantities of data, ranging from 5% to 100%, and evaluate them on the same testing set. Figure 4 shows the results, which indicate that our model can still achieve leading performance even with minimal training data. True/False Rumor Detection. We separately compare AARD with another graph-based model, GCAN, since GCAN focuses on true rumor/false rumor classification task and evaluates on Twit-ter15 and Twitter16. 5 Table 3 shows the results of true rumor/false rumor classification, which indicates that AARD also has an excellent performance in the rumor classification. We consider it is because differentiating the false rumor from true rumor also requires the model to carefully examine the responses. Early Detection. Early detection aims to detect rumors in the early stage, which is an important indicator for evaluating the detection model. We refer (Bian et al., 2020) and (Ma et al., 2019) to construct the detection deadlines of Twitter15 and Pheme datasets and only use the responses re-leased before the deadlines to evaluate the accuracy. Figure 3 compares the accuracy with different detection deadlines. At the early stage, i.e., when a post just came out with extremely few responses, the accuracy of different models is around 0.75 on the Twitter15 dataset. After just a few minutes, the accuracy of our model reaches 0.85, whereas the accuracy of baselines only approximates 0.8. For the Pheme dataset, we squeeze the time sequence and find that the performances of all models become stable but our model stably outperforms others. Table 4 shows the model performance under an adversarial attack generated by PARG. The notation "→" indicates the performance before and after the attack, while "Diff." and "ASR" represent the accuracy difference and the Attack Success Rate (ASR) of PARG, respectively. The results indicate that the proposed PARG significantly reduces the accuracy of the detectors. The ASR is lower on the Pheme dataset than on Twitter15 and Twitter16 datasets because Pheme is a much larger dataset than the others. Therefore, the detector can learn more indicated features from Pheme and be more robust. Moreover, by comparing the performance of the detector with (WETGN) and without (-EF) edge filter, adding the edge filter can help the detector resist the attack, that is, the "Diff." is lower on  WETGN for Twitter15 and Twitter16. In addition, we use the adversarial samples to fine-tune the detector (AARD). The bottom of Table 4 shows the results of the fine-tuned model, where the Adversarial/Clean indicates the accuracy tested on the dataset with/without adversarial attacks. The performance of the AARD without the edge filter is also provided, which also suggests the edge filter can improve the robustness.

Adversarial Attack
When a detector is without adversarial learning (WETGN), the performance decreases by at least 20% on Twitter15 and Twitter16 datasets when encountering an attack. In contrast, the proposed detector with adversarial learning can maintain the performance even when the attacker has access to the model's parameters (white-box attack). It is worth noting that there may be a trade-off between the adversarial accuracy (test on adversary data) and the clean accuracy (test on clean data) (Raghunathan et al., 2019), depending on how we fine-tune the detection models, e.g., only using the adversarial data or using both kinds of data. The AARD is only fine-tuned on the adversarial data. By adjusting the experimental settings, the clean accuracy can be further improved in exchange for adversarial accuracy. Compared AARD to WETGN, it suggests that the fine-tuned detection model can resist the attack (adversarial accuracy increases from 71.13 to 92.44) while almost not affecting the clean accuracy (from 93.47 to 93.06) on the Twitter15 dataset.
Two examples of the generated adversarial responses that attack successfully are shown in Table 5. In the first example, the source post is a rumor, and PARG alters the prediction by inserting a response "not a false alarm", which conveys a signal that it is actually not a rumor. For the second example, which is a non-rumor, PARG attacks it with a certain attitude "can not believe" to deny that it is a "true" story. Similar responses can also be found in the real response written by human. If a rumor detector only captures simple patterns, it may easily misclassify the above examples and fail to adversarial attacks.

Conclusion
In this paper, we propose a novel rumor detection framework, AARD, to improve the vulnerability of detection models, which includes the Weighted-Edge Transformer-Graph Network (WETGN) and the Position-aware Adversarial Response Generator (PARG). Overall evaluation and ablation study results show the effectiveness of the proposed rumor detector on three public datasets. In addition, the adversarial attack results show the benefit of fine-tuning with the adversarial responses generated by PARG. In the future, we plan to further study the model generalization on rumor veracity classification tasks and combine the response stances.