Towards Propagation Uncertainty: Edge-enhanced Bayesian Graph Convolutional Networks for Rumor Detection

Detecting rumors on social media is a very critical task with significant implications to the economy, public health, etc. Previous works generally capture effective features from texts and the propagation structure. However, the uncertainty caused by unreliable relations in the propagation structure is common and inevitable due to wily rumor producers and the limited collection of spread data. Most approaches neglect it and may seriously limit the learning of features. Towards this issue, this paper makes the first attempt to explore propagation uncertainty for rumor detection. Specifically, we propose a novel Edge-enhanced Bayesian Graph Convolutional Network (EBGCN) to capture robust structural features. The model adaptively rethinks the reliability of latent relations by adopting a Bayesian approach. Besides, we design a new edge-wise consistency training framework to optimize the model by enforcing consistency on relations. Experiments on three public benchmark datasets demonstrate that the proposed model achieves better performance than baseline methods on both rumor detection and early rumor detection tasks.


Introduction
With the ever-increasing popularity of social media sites, user-generated messages can quickly reach a wide audience. However, social media can also enable the spread of false rumor information (Vosoughi et al., 2018). Rumors are now viewed as one of the greatest threats to democracy, journalism, and freedom of expression. Therefore, detecting rumors on social media is highly desirable and socially beneficial (Ahsan et al., 2019). Almost all the previous studies on rumor detection leverage text content including the source tweet and all user retweets or replies. As time goes on, rumors form their specific propagation structures after being retweeted or replied to. Vosoughi (2015); Vosoughi et al. (2018) have confirmed rumors spread significantly farther, faster, deeper, and more broadly than the truth. They provide the possibility of detecting rumors through the propagation structure. Some works (Ma et al., 2016;Kochkina et al., 2018) typically learn temporal features alone from propagation sequences, ignoring the internal topology. Recent approaches (Ma et al., 2018;Khoo et al., 2020) model the propagation structure as trees to capture structural features. Bian et al. (2020);  construct graphs and aggregate neighbors' features through edges based on reply or retweet relations.
However, most of them only work well in a narrow scope since they treat these relations as reliable edges for message-passing. As shown in Figure  1, the existence of inaccurate relations brings uncertainty in the propagation structure. The neglect of unreliable relations would lead to severe error accumulation through multi-layer message-passing and limit the learning of effective features.
We argue such inherent uncertainty in the propagation structure is inevitable for two aspects: i) In the real world, rumor producers are always wily. They tend to viciously manipulate others to create fake supporting tweets or remove opposing voices to evade detection . In these common scenarios, relations can be manipulated, which provides uncertainty in the propagation structure. ii) Some annotations of spread relations are subjective and fragmentary (Ma et al., 2017;Zubiaga et al., 2016). The available graph would be a portion of the real propagation structure as well as contain noisy relations, resulting in uncertainty. Therefore, it is very challenging to handle inherent uncertainty in the propagation structure to obtain robust detection results.
To alleviate this issue, we make the first attempt to explore the uncertainty in the propagation structure. Specifically, we propose a novel Edgeenhanced Bayesian Graph Convolutional Network (EBGCN) for rumor detection to model the uncertainty issue in the propagation structure from a probability perspective. The core idea of EBGCN is to adaptively control the message-passing based on the prior belief of the observed graph to surrogate the fixed edge weights in the propagation graph. In each iteration, edge weights are inferred by the posterior distribution of latent relations according to the prior belief of node features in the observed graph. Then, we utilize graph convolutional layers to aggregate node features by aggregating various adjacent information on the refining edges. Through the above network, EBGCN can handle the uncertainty in the propagation structure and promote the robustness of rumor detection.
Moreover, due to the unavailable of missing or inaccurate relations for training the proposed model, we design a new edge-wise consistency training framework. The framework combines unsupervised consistency training on these unlabeled relations into the original supervised training on labeled samples, to promote better learning. We further ensure the consistency between the latent distribution of edges and the distribution of node features in the observed graph by computing KLdivergence between two distributions. Ultimately, both the cross-entropy loss of each claim and the Bayes by Backprop loss of latent relations will be optimized to train the proposed model.
We conduct experiments on three real-world benchmark datasets (i.e., Twitter15, Twitter16, and PHEME). Extensive experimental results demonstrate the effectiveness of our model. EBGCN of-fers a superior uncertainty representation strategy and boosts the performance for rumor detection. The main contributions of this work are summarized as follows: • We propose novel Edge-enhanced Bayesian Graph Convolutional Networks (EBGCN) to handle the uncertainty in a probability manner.
To the best of our knowledge, this is the first attempt to consider the inherent uncertainty in the propagation structure for rumor detection.
• We design a new edge-wise consistency training framework to optimize the model with unlabeled latent relations.
• Experiments on three real-world benchmark datasets demonstrate the effectiveness of our model on both rumor detection and early rumor detection tasks 1 .
2 Related Work

Rumor Detection
Traditional methods on rumor detection adopted machine learning classifiers based on handcrafted features, such as sentiments (Castillo et al., 2011), bag of words (Enayet and El-Beltagy, 2017) and time patterns (Ma et al., 2015). Based on salient features of rumors spreading, Wu et al. (2015); Ma et al. (2017) modeled propagation trees and then used SVM with different kernels to detect rumors. Recent works have been devoted to deep learning methods. Ma et al. (2016) employed Recurrent Neural Networks (RNN) to sequentially process each timestep in the rumor propagation sequence. To improve it, many researchers captured more long-range dependency via attention mechanisms (Chen et al., 2018), convolutional neural networks (Yu et al., 2017;Chen et al., 2019), and Transformer (Khoo et al., 2020). However, most of them focused on learning temporal features alone, ignoring the internal topology structure.
To capture topological-structural features, Ma et al. (2018)  However, most of them treat the edge as the reliable topology connection for message-passing. Ignoring the uncertainty caused by unreliable relations could lead to lacking robustness and make it risky for rumor detection. Inspired by valuable research (Zhang et al., 2019a) that modeled uncertainty caused by finite available textual contents, this paper makes the first attempt to consider the uncertainty caused by unreliable relations in the propagation structure for rumor detection.

Graph Neural Networks
Graph Neural Networks (GNNs) (Kipf and Welling, 2017;Schlichtkrull et al., 2018;Velickovic et al., 2018) have demonstrated remarkable performance in modeling structured data in a wide variety of fields, e.g., text classifcation (Yao et al., 2019), recommendation system (Wu et al., 2019) and emotion recognition (Ghosal et al., 2019). Although promising, they have limited capability to handle uncertainty in the graph structure. While the graphs employed in real-world applications are themselves derived from noisy data or modeling assumptions. To alleviate this issue, some valuable works (Luo et al., 2020;Zhang et al., 2019b) provide an approach for incorporating uncertain graph information by exploiting a Bayesian framework (Maddox et al., 2019). Inspired by them, this paper explores the uncertainty in the propagation structure from a probability perspective, to obtain more robust rumor detection results.

Problem Statement
This paper develops EBGCN which processes text contents and propagation structure of each claim for rumor detection. In general, rumor detection commonly can be regarded as a multi-classification task, which aims to learn a classifier from training claims for predicting the label of a test claim.
Formally, let C = {c 1 , c 2 , ..., c m } be the rumor detection dataset, where c i is the i-th claim and m is the number of claims. For each claim indicates the propagation structure, r i is the source tweet, x i j refers to the j-th relevant retweet, and n i represents the number of tweets in the claim c i . Specifically, G i is defined as a propagation graph G i = V i , E i with the root node r i (Ma et al., 2018;Bian et al., 2020) .., x i n i −1 } refers to the node set and E i = {e i st |s, t = 0, ..., n i − 1} represent a set of directed edges from a tweet to its corresponding retweets. Denote A i ∈ R n i ×n i as an adjacency matrix where the initial value is Besides, each claim c i is annotated with a ground-truth label y i ∈ Y, where Y represents finegrained classes. Our goal is to learn a classifier from the labeled claimed set, that is f : C → Y.

The Proposed Model
In this section, we propose a novel edge-enhanced bayesian graph convolutional network (EBGCN) for rumor detection in Section 4.2. For better training, we design an edge-wise consistency training framework to optimize EBGCN in Section 4.3.

Overview
The overall architecture of EBGCN is shown in Figure 2. Given the input sample including text contents and its propagation structure, we first formulate the propagation structure as directed graphs with two opposite directions, i.e., a top-down propagation graph and a bottom-up dispersion graph. Text contents are embedded by the text embedding layer. After that, we iteratively capture rich structural characteristics via two main components, node update module, and edge inference module. Then, we aggregate node embeddings to generate graph embedding and output the label of the claim.
For training, we incorporate unsupervised consistency training on the Bayes by Backprop loss of unlabeled latent relations. Accordingly, we optimize the model by minimizing the weighted sum of the unsupervised loss and supervised loss.

Edge-enhanced Bayesian Graph
Convolutional Networks

Graph Construction and Text Embedding
The initial graph construction is similar to the previou work (Bian et al., 2020), i.e., build two distinct directed graphs for the propagation structure of each claim c i . The top-down propagation graph and bottom-up dispersion graph are denoted as G T D i and G BU i , respectively. Their corresponding initial adjacency matrices are  Here, we leave out the superscript i in the following description for better presenting our method.
The initial feature matrix of postings in the claim c can be extracted Top-5000 words in terms of TF-IDF values, denoted as is the vector of the source tweet and d 0 is the dimensionality of textual features. The initial feature matrices of nodes in propagation graph and dispersion graph are the same, i.e., X T D = X BU = X.

Node Update
Graph convolutional networks (GCNs) (Kipf and Welling, 2017) are able to extract graph structure information and better characterize a node's neighborhood. They define multiple Graph Conventional Layers (GCLs) to iteratively aggregate features of neighbors for each node and can be formulated as a simple differentiable message-passing framework. Motivated by GCNs, we employ the GCL to update node features in each graph. Formally, node features at the l-th layer can be defined as, whereÂ (l−1) represents the normalization of adjacency matrix A (l−1) (Kipf and Welling, 2017). We initialize node representations by textual features, i.e., H (0) = X.

Edge Inference
To alleviate the negative effects of unreliable relations, we rethink edge weights based on the cur-rently observed graph by adopting a soft connection.
Specifically, we adjust the weight between two nodes by computing a transformation f e (·; θ t ) based on node representations at the previous layer. Then, the adjacency matrix will be updated, i.e., In practice, f e (·; θ t ) consists an convolutional layer and an activation function. T refers to the number of latent relation types. σ(·) refers to a sigmoid function. W (l) t and W (l) t are learnable parameters. We perform share parameters to the edge inference layer in two graphs G T D and G BU . After the stack of transformations in two layers, the model can effectively accumulate a normalized sum of features of the neighbors driven by latent relations, denoted as H T D and H BU .

Classification
We regard the rumor detection task as a graph classification problem. To aggregate node representations in the graph, we employ aggregator to form the graph representations. Given the node representations in the propagation graph H T D and the node representations in the dispersion graph H BU , the graph representations can be computed as: where meanpooling(·) refers to the mean-pooling aggregating function. Based on the concatenation of two distinct graph representations, label probabilities of all classes can be defined by a full connection layer and a softmax function, i.e., where W c and b c are learnable parameter matrices.

Edge-wise Consistency Training Framework
For the supervised learning loss L c , we compute the cross-entropy of the predictions and ground truth distributions C = {c 1 , c 2 , ..., c m }, i.e., where y i is a vector representing distribution of ground truth label for the i-th claim sample.
For the unsupervised learning loss L e , we amortize the posterior distribution of the classification weight p(ϕ) as q(ϕ) to enable quick prediction at the test stage and learn parameters by minimizing the average expected loss over latent relations, i.e., ϕ * = arg min ϕ L e , where L e = E D KL p(r (l) |H (l−1) , G) q ϕ (r (l) |H (l−1) , G) , wherer is the prediction distribution of latent relations. To ensure likelihood tractably, we model the prior distribution of each latent relation r t , t ∈ [1, T ] independently. For each relation, we define a factorized Gaussian distribution for each latent relation q ϕ (ϕ|H (l−1) , G; Θ) with means µ t and variances δ 2 t set by the transformation layer, where f µ (·; θ µ ) and f δ (·; θ µ ) refer to compute the mean and variance of input vectors, parameterized by θ µ and θ δ , respectively. Such that amounts to set the weight of each latent relation. Besides, we also consider the likelihood of latent relations when parameterizing the posterior distribution of prototype vectors. The likelihood of latent relations from the l-th layer based on node embeddings can be adaptively computed by, In this way, the weight of edges can be adaptively adjusted based on the observed graph, which can thus be used to effectively pass messages and learn more discriminative features for rumor detection.
To sum up, in training, we optimize our model EBGCN by minimizing the cross-entropy loss of labeled claims L c and Bayes by Backprop loss of unlabeled latent relations L e , i.e., where γ is the trade-off coefficient.

Datasets
We evaluate the model on three real-world benchmark datasets: Twitter15 (Ma et al., 2017), Twit-ter16 (Ma et al., 2017), and PHEME (Zubiaga et al., 2016). The statistics are shown in Table 1. Twitter15 and Twitter16 2 contain 1,490 and 818 claims, respectively. Each claim is labeled as Nonrumor (NR), False Rumor (F), True Rumor (T), or Unverified Rumor (U). Following (Ma et al., 2018;Bian et al., 2020), we randomly split the dataset into five parts and conduct 5-fold cross-validation to obtain robust results. PHEME dataset 3 provides 2,402 claims covering nine events and contains three labels, False Rumor (F), True Rumor (T), and Unverified Rumor (U). Following the previous work , we conduct leave-oneevent-out cross-validation, i.e., in each fold, one event's samples are used for testing, and all the rest are used for training.

Baselines
For Twitter15 and Twitter16, we compare our proposed model with the following methods. DTC   For PHEME, we compare with several representative state-of-the-art baselines. NileTMRG (Enayet and El-Beltagy, 2017) used linear support vector classification based on bag of words. BranchLSTM (Kochkina et al., 2018) decomposed the propagation tree into multiple branches and adopted a shared LSTM to capture structural features. RvNN (Ma et al., 2018) consisted of two recursive neural networks to model propagation trees. Hierarchical GCN-RNN  modeled structural property based on GCN and RNN. BiGCN (Bian et al., 2020) consisted of propagation and dispersion GCNs to learn structural features from propagation graph.

Parameter Settings
Following comparison baselines, the dimension of hidden vectors in the GCL is set to 64. The number of latent relations T and the coefficient weight γ are set to [1,5] and [0.0, 1.0], respectively. we train the model via backpropagation and a wildly used stochastic gradient descent named Adam (Kingma and Ba, 2015). The learning rate is set to {0.0002, 0.0005, 0.02} for Twitter15, Twit-ter16, and PHEME, respectively. The training process is iterated upon 200 epochs and early stopping (Yuan et al., 2007) is applied when the validation loss stops decreasing by 10 epochs. The optimal set of hyperparameters are determined by testing the performance on the fold-0 set of Twitter15 and Twitter16, and the class-balanced charlie hebdo event set of PHEME.
Besides, on PHEME, following , we replace TF-IDF features with word embeddings by skip-gram with negative sampling (Mikolov et al., 2013) and set the dimension of textual features to 200. We implement this variant of BiGCN and EBGCN, denoted as BiGCN(SKP) and EBGCN(SKP), respectively.
For results of baselines, we implement BiGCN according to their public project 4 under the same environment. Other results of baselines are referenced from original papers (Khoo et al., 2020;Ma et al., 2018). 6 Results and Analysis 6.1 Performance Comparison with Baselines Table 2 shows results of rumor detection on Twit-ter15, Twitter16, and PHEME datasets. Our proposed model EBGCN obtains the best performance among baselines. Specifically, for Twitter15, EBGCN outperforms state-of-the-art models 2.4% accuracy and 3.6% F1 score of false rumor. For Twitter16, our model obtains 3.4% and 6.0% improvements on accuracy and F1 score of non-rumor, respectively. For PHEME, EBGCN significantly outperforms previous work by 40.2% accuracy, 34.7% mF 1 , and 18.0% wF 1 .
Deep learning-based (RvNN, StA-PLAN, BiGCN and EBGCN) outperform conventional methods using hand-crafted features (DTC, SVM-TS), which reveals the superiority of learning high-level representations for detecting rumors.  Moreover, compared with sequence-based models GRU-RNN, and StA-PLAN, EBGCN outperform them. It can attribute that they capture temporal features alone but ignore internal topology structures, which limit the learning of structural features. EBGCN can aggregate neighbor features in the graph to learn rich structural features.
Furthermore, compared with state-of-the-art graph-based BiGCN, EBGCN also obtains better performance. We discuss the fact for two main reasons. First, BiGCN treats relations among tweet nodes as reliable edges, which may introduce inaccurate or irrelevant features. Thereby their performance lacks robustness. EBGCN considers the inherent uncertainty in the propagation structure. In the model, the unreliable relations can be refined in a probability manner, which boosts the bias of express uncertainty. Accordingly, the robustness of detection is enhanced. Second, the edge-wise consistency training framework ensures the consistency between uncertain edges and the current nodes, which is also beneficial to learn more effective structural features for rumor detection.
Besides, EBGCN(SKP) and BiGCN(SKP) outperforms EBGCN and BiGCN that use TF-IDF features in terms of Acc. and wF 1 . It shows the superiority of word embedding to capture textual features. Our model consistently obtains better performance in different text embedding. It reveals the stability of EBGCN.

Model Analysis
In this part, we further evaluate the effects of key components in the proposed model.
The Effect of Edge Inference. The number of latent relation types T is a critical parameter in the edge inference module. Figure 3(a) shows the accuracy score against T . The best performance is obtained when T is 2, 3, and 4 on Twitter15, Twit-ter16, and PHEME, respectively. Besides, these best settings are different. An idea explanation is that complex relations among tweets are various in different periods and gradually tend to be more sophisticated in the real world with the development of social media. The edge inference module can adaptively refine the reliability of these complex relations by the posterior distribution of latent relations. It enhances the bias of uncertain relations and promotes the robustness of rumor detection.
The Effect of Unsupervised Relation Learning Loss. The trade-off parameter γ controls the effect of the proposed edge-wise consistency training framework. γ = 0.0 means this framework is omitted. The right in Figure 3 shows the accuracy score against γ. When this framework is removed, the model gains the worst performance. The optimal γ is 0.4, 0.3, and 0.3 on Twitter15, Twitter16, and PHEME, respectively. The results proves the effectiveness of this framework. Due to wily rumor producers and limited annotations of spread information, it is common and inevitable that datasets contains unreliable relations. This framework can ensure the consistency between edges and the corresponding node pairs to avoid the negative features.

Early Rumor Detection
Rumor early detection is to detect a rumor at its early stage before it wide-spreads on social media so that one can take appropriate actions earlier. It is especially critical for a real-time rumor detection system. To evaluate the performance on rumor early detection, we follow (Ma et al., 2018) and control the detection deadline or tweet count since the source tweet was posted. The earlier the detec-tion deadline or the less the tweet count, the less propagation information can be available. Figure 4 shows the performance of early rumor detection. First, all models climb as the detection deadline elapses or tweet count increases. Particularly, at each deadline or tweet count, our model EBGCN reaches a relatively high accuracy score than other comparable models.
Second, compared with RvNN that captures temporal features alone and STM-TK based on handcrafted features, the superior performance of EBGCN and BiGCN that explored rich structural features reveals that structural features are more beneficial to the early detection of rumors.
Third, EBGCN obtains better early detection results than BiGCN. It demonstrates that EBGCN can learn more conducive structural features to identify rumors by modeling uncertainty and enhance the robustness for early rumor detection.
Overall, our model not only performs better longterm rumor detection but also boosts the performance of detecting rumors at an early stage.

The Case Study
In this part, we perform the case study to show the existence of uncertainty in the propagation structure and explain why EBGCN performs well. We randomly sample a false rumor from PHEME, as depicted in Figure 5. The tweets are formulated as nodes and relations are modeled as edges in the graph, where node 1 refers to the source tweet and node 2-8 refer to the following retweets.
As shown in the left of Figure 5, we observe that tweet 5 is irrelevant with tweet 1 although replying, which reveals the ubiquity of unreliable relations among tweets in the propagation structure and it is reasonable to consider the uncertainty caused by these unreliable relations.
Right of Figure 5 indicates constructed graphs where the color shade indicates the value of edge weights. The darker the color, the greater the edge weight. The existing graph-based models always generate the representation of node 1 by aggregating the information of its all neighbors (node 2, 5, and 6) according to seemingly reliable edges. However, edge between node 1 and 5 would bring noise features and limit the learning of useful features for rumor detection. Our model EBGCN successfully weakens the negative effect of this edge by both the edge inference layer under the ingenious edge-wise consistency training framework. Accordingly, the 5 Hi Henry would you be willing to give ITV News a phone interview for our Lunchtime bulletin in 2 hours?
The religion of peace strikes again.
if only people didn't hand out guns Explain.
Tickets go on sale this week Kill them wherever you find them, and turn them out from where they have turned you out.
Idiot strikes again with his stupid tweet.
Breaking: At least 10 dead, 5 injured after to gunman open fire in offices of Charlie Hebdo, satirical mag that published Mohammed cartoons model is capable of learning more conducive characteristics and enhances the robustness of results.

Conclusion
In this paper, we have studied the uncertainty in the propagation structure from a probability perspective for rumor detection. Specifically, we propose Edge-enhanced Bayesian Graph Convolutional Networks (EBGCN) to handle uncertainty with a Bayesian method by adaptively adjusting weights of unreliable relations. Besides, we design an edge-wise consistency training framework incorporating unsupervised relation learning to enforce the consistency on latent relations. Extensive experiments on three commonly benchmark datasets have proved the effectiveness of modeling uncertainty in the propagation structure. EBGCN significantly outperforms baselines on both rumor detection and early rumor detection tasks.