Debunking Rumors on Twitter with Tree Transformer

Rumors are manufactured with no respect for accuracy, but can circulate quickly and widely by “word-of-post” through social media conversations. Conversation tree encodes important information indicative of the credibility of rumor. Existing conversation-based techniques for rumor detection either just strictly follow tree edges or treat all the posts fully-connected during feature learning. In this paper, we propose a novel detection model based on tree transformer to better utilize user interactions in the dialogue where post-level self-attention plays the key role for aggregating the intra-/inter-subtree stances. Experimental results on the TWITTER and PHEME datasets show that the proposed approach consistently improves rumor detection performance.


Introduction
Online rumor perhaps is one of the most prevalent social diseases in the era of social media. An immediate example we are witnessing is the unprecedented information disorder represented by various rumors, conspiracy theories, hoaxes, fake news, etc. in parallel with the worldwide pandemic of COVID19. In different places, a number of people were hospitalized or even died for drinking bootleg alcohol to prevent coronavirous infection, resulting from a false rumor attack on gullible public claiming that "smoking, methanol or cocaine can cure for the virus" 1 . Automatic rumor debunking is at the core of battle against such massive disorder of information especially in the midst of crisis.
Rumor debunking aims to determine the veracity of a given topic or a claim. Fact-checking websites, such as snopes.com and politifact.com, employ manual verification and investigative journalism, which is prone to low efficiency and poor coverage. For automated approaches, prior studies focus on engineering or learning features from sequential microblog streams (Castillo et al., 2011;Yang et al., 2012;Kwon et al., 2013;Liu et al., 2015;Ma et al., 2015;Ma et al., 2016;Yu et al., 2017). More recently, structure-based learning based on structured neural networks are proposed to capture the interactive characteristics of rumor diffusion, such as tree kernel (Ma et al., 2017), recursive neural network (Ma et al., 2018) and tree LSTM model (Kumar and Carley, 2019). Khoo et al. (2020) proposed to model potential dependencies between any two microblog posts with the post-level self-attention networks (PLAN), which has achieved the state-of-the-art detection performance.
The PLAN model essentially treats the input tweets as a fully connected graph, by assuming that a user may not be directed solely at the tweet being replied considering the content created could also be applicable to other tweets in the thread (Khoo et al., 2020). Also, the representation of posts is enhanced by leveraging the strength of transformer's encoding architecture. Nevertheless, we argue that such full connection which ignores the specific targets of replies in the hierarchy could create salient issues on post representation learning, especially in the vein of relatively deep conversation or argument. Meanwhile, other existing tree-structured models based on propagation trees (Wu et al., 2015;Ma et al., 2017) or recursive trees (Ma et al., 2018;Kumar and Carley, 2019) tend to oversimplify user interaction by genuinely following the tree edges for post matching or encoding.  Figure 1: A motivating example: A false rumor about "Mike Pence delivering empty boxes of PPE for PR stunt" widely spread on Twitter and the stances relative to parent nodes implying the underlying credibility of the claim.
To illustrate our intuition, Figure 1 exemplifies the propagation structure of a (rumor) claim "Mike Pence caught on hot mic delivering empty boxes of PPE for a PR stunt". The PLAN model basically assumes each user directed at all the tweets in the thread, which may be true for a shallow tree where most of nodes respond to the root node. However, this is not the case when it comes to a tree hierarchy as Figure 1 shows. It can be seen that accurate viewpoints is generally associated with the context of parent posts, e.g., x 21 support x 2 , but x 2 refute the source claim r, therefore x 21 believe that the claim is false even though it contains a non-rumor-indicative patten "be right". On the other hand, x 21 even has no context correlation with the nodes from another branch such as x 12 . But the PLAN model might brought unexpected errors in this case when linking x 21 with r (or x 12 ) when making fully pairwise comparison.
To this end, we propose to enhance the representation by exploring the stances towards the same target utilizing the associated contextual information. The starting point of our approach is an observation: each post in the propagation tree may trigger a set of responsive tweets (such as x 1 → [x 11 , x 12 ] in Figure 1), we define such unit as a subtree, which eventually compose the whole tree hierarchy. Accordingly, we extend the conventional transformer's encoder into three variants, i.e., a bottom-up transformer, a top-down transformer, and a hybrid transformer model. More specifically, our models selectively attend over tweets in the same subtree. As a result, it can be expected that user's viewpoint can be fully captured based on the context of propagation path. Meanwhile, inaccurate information in a subtree can be cross-checked as users share opinions towards the same target (i.e., the subtree root). We construct two shallow tree datasets and two deep tree datasets referring from two publicly benchmarks TWITTER and PHEME. Extensive experimental results demonstrate that our approach consistently improve over the state-of-the-art rumor detection and early detection baselines, particularly performing well on the deep trees.

Related Work
This section firstly reviews the recent progress about rumor detection. Most previous automatic approaches for rumor detection (Castillo et al., 2011;Yang et al., 2012;Liu et al., 2015) intended to learn a supervised classifier by utilizing a wide range of features crafted from post contents, user profiles and propagation patterns. Subsequent studies were then conducted to engineer new features such as those representing rumor diffusion and cascades (Kwon et al., 2013;Friggeri et al., 2014;Hannak et al., 2014). Ma et al. (2015) extended their model with a large set of chronological social context features. These approaches typically require heavy preprocessing and feature engineering. Zhao et al. (2015) alleviated the engineering effort by using a set of regular expressions (such as "really?", "not true", etc) to find questing and denying tweets, but the approach was oversimplified and suffered from very low recall. Ma et al. (2016) and Yu et al. (2017) respectively used recurrent neural networks (RNNs) and convolutional neural networks (CNNs) to learn automatically the representations from tweets content based on time series. Guo et al. (2018) proposed a hierarchical attention model which captures important clues from social context of a rumorous event at the post and sub-event levels. Jin et al. (2016) exploited the conflicting viewpoints in a credibility propagation network for verifying news stories propagated among the tweets. However, these approaches cannot embed features reflecting how the posts are propagated and requires careful data segmentation to prepare for the time sequence.
Some kernel-based methods were exploited to model the propagation structure. Wu et al. (2015) proposed a hybrid SVM classifier which combines a RBF kernel and a random-walk-based graph kernel to capture both flat and propagation patterns for detecting rumors on Sina Weibo. Ma et al. (2017) used tree kernel to capture the similarity of propagation trees by counting their similar substructures in order to identify different types of rumors on Twitter. Ma et al. (2018) presented tree-structured recursive neural networks (RvNN) to jointly generate the representation of a propagation tree based on the posts content and their propagation structure.
In recent years, transformer (Vaswani et al., 2017) have demonstrates state-of-the-art performance in a variety of NLP tasks such as machine translation (Vaswani et al., 2017), sentence representation (Devlin et al., 2019), generative dialog (Tao et al., 2018), machine reading (Cheng et al., 2016), semantic labeling (Strubell et al., 2018), and rumor detection (Khoo et al., 2020). Transformer produce strong power of representations by applying attention to each pair of elements from an input sequence, regardless of their distance. Khoo et al. (2020) propose a rumor verification model that allows direct modeling of dependencies between any two posts without regarding to their responsive relation, thus it essentially treats the propagation as a fully connected graph instead of a tree. Our work is inspired by the idea of improving the representation power of transformer to model structured objects such as syntactic parse tree. In these works, a straightforward strategy is to augment the conventional transformer with structural positional embeddings (Wang et al., 2019a;Shiv and Quirk, 2019). On the other hand, Tree Transformer is proposed to attend over nearer neighbor nodes (Ahmed et al., 2019;Wang et al., 2019b). Our proposed method is a substantial extension of Tree Transformer for modeling propagation tree structures for detecting rumors on microblogging websites.

Problem Statement and Notations
On microblogging platforms such as Twitter, the follower/friend relationship embeds shared interests among the users. Once a user has posted a tweet, all his followers will receive it. Twitter allows a user to retweet or comment on another user's post, so that the information could reach beyond the followers of the original creator. Therefore, we model the propagation of each claim as a tree structure T (r) = V, E , where r is tree root representing the source tweet that states the claim, V refers to a set of nodes each representing a responding post of r in the thread of the circulation, and E is a set of directed edges corresponding to the response relation among the nodes in V . Inspired by (Ma et al., 2018), here we consider two different propagation trees with distinct edge directions: (1) Bottom-Up tree where the responsive nodes point to their responded nodes, similar to a citation network; and (2) Top-Down tree where the edge follows the direction of information diffusion by reversing the Bottom-up tree.
We formulate this task as a supervised classification problem, which learns a classifier f from the labeled claims, that is, f : C i → Y i , where Y i takes one of the four categories: Non-rumor, True rumor, False rumor and Unverified rumor (NTFU), that are introduced in previous literature (Zubiaga et al., 2016b;Ma et al., 2017).

Tree Transformer Model for Rumor Detection
Rumor indicative features can be captured from propagation structures, e.g., the stances expressed in responsive tweets can further reinforce the stances of that tweet is replying to (Ma et al., 2018;Kumar and Carley, 2019), the posts with strong stance based on the tree branch is more important when determining the rumor veracity (Li et al., 2019), and inaccurate information might be "self-checked" by making comparison with correlative tweets . However, such relation is not fully exploited by previous work. Our core idea is to enhance representation learning of rumor indicative features by selectively attending over the corresponding tweets, that deeply explore user opinions and refine inaccurate information following the propagation tree structure.
Unlike the PLAN model that rawly handcraft 5 types of responsive relation as an additional consideration when attending over all the other tweets, our idea and the adopted mechanisms are significantly different.  Figure 2(a), T (·) denote a subtree rooted by the node in green, that we put at the first line of the subtree. The edges in red and blue apply the Bottom-Up and Top-Down tree respectively. Figure 2 gives an overview of our transformer-based framework respectively based on Bottom-Up tree and Top-Down tree, which will be depicted in detail in the subsections.

Token-Level Tweet Representation
Given a tweet represented as a word sequence x i = (w 1 · · · w t · · · w |x i | ), each w t ∈ R d is a d-dimensional vector which can be initialized with pre-trained word embeddings. We map each w t into a fixed-sized hidden vector using Multi-Head Self-Attention networks(MH-SAN), which are the defaults setting in Transformer encoder (Vaswani et al., 2017). The core idea of MH-SAN is to jointly attend to words from different representation subspaces at different positions. More specifically, MH-SAN firstly transform the input word sequence x i into multiple subspaces with different linear projections: Here √ d h is the scaling factor and d h represent the dimensionality of the h th head subspace. Finally, the output of representation could be regard as a concatenation of all the heads O i = [O 1 i , O 2 i , · · · , O n i ] ∈ R |x i |×d with n as the number of heads, which followed by a normalization layer (layerNorm) and a feed-forward network (FFN) consistant with the usage of Transformer.
where H i = [h 1 ; . . . ; h |x i | ] ∈ R |x i |×d is the matrix representing all words in tweet x i , and W B and W h contain the weights of the transformation. Finally, we obtain the representation of x i by maxpooling the vectors of all involved words: where s i ∈ R 1×d is a d-dimensional vector, and | · | denotes the number of words.

Post-Level Tweet Representation
Previous literature has generally found that each node in the tree can trigger a set of responsive posts, i.e., a subtree, which contain distinct rumor-indicative pattens (Ma et al., 2017). Our goal is to cross-check all the posts in the same subtree to enhance the representation learning, because: (1) posts are generally short in nature thus the stance expressed in each node is closely correlated with the responsive context; and (2) posts in the same subtree direct at the individual opinion expressed in the root of the subtree. Thus coherent opinions can be captured by comparing ALL responsive posts in the same subtree, that lower weight the incorrect information (e.g., the supportive posts towards a false claim).
To this end, we propose to utilize transformer-based network to make pairwise comparison within a subtree, that capture users' opinions and enhance the representation for each node. In this paper, we develop two structures respectively based on Bottom-Up tree and Top-Down tree: Bottom-Up Transformer. In Bottom-Up tree, we visit the root of each subtree from the leaf node hierarchically until reaching the source tweet. We propose a Bottom-Up transformer to capture coherent attitudes towards each tree node, by making pairwise comparison among its responsive tweets.
Figure 2(c) illustrated the structure of our tree transformer that cross-check all the posts from the bottom subtree to the upper subtrees. Specifically, given a subtree rooted at x j , Let V(j) = {x j , . . . , x k } denote the set of node in the subtree, i.e., x j and its direct response nodes. Then we apply a post-level subtree attention (i.e., a transformer-based block as shown in Figure 2(b)) on V(j) to get the refined representation for each node in V(j): where TRANS(·) is the transform function that has similar forms as shown in Eq. 2-4, and Θ T contains the transformer parameters. Thus s * is the refined representation of s * obtained based on the context of subtree. Note that each node can be treated as either parent or child in different subtrees, e.g., in Figure 2(a), x 2 can either be the parent node of T (x 2 ), or a child node of T (r). As a result, a part of nodes in our model are refined twice hierarchically from bottom subtree to upper subtree, that: (1) capture stances by comparing with parent node, and (2) lower-weight inaccurate information by attending over neighbor nodes, e.g., a parent that support a false claim might be refined if the majority responses refute the parent node.
Top-Down Transformer. This model is designed to leverage the structure of Top-Down tree, which is shown in Figure 2(d). Since Top-Down tree models how the information flows from source post to the current node, our model visits each subtree hierarchically from the source node until the leaf nodes. The transformer mechanism shares the similar intuition as the Bottom-Up transformer, thus node representation is enhanced by capturing stances and self-corrected context information.

The overall Model
To jointly capture the opinions expressed in the whole tree, we utilize an attention layer to select important posts with accurate information, which is obtained based on the refined node representation. This yields: where s i is obtained from either Bottom-Up Transformer or Top-Down Transformer 2 , and µ ∈ R 1×d contains the weights of the transformation. Here α i is the attention weight of node x i which is used to produce the representations of an entire tree. Lastly, we use a fully connected output layer to predict the probability distribution over the rumor classes.
where V o and b o are the weights and bias in the output layer. Furthermore, there is a straightforward way to concatenate the tree representation from the Bottom-Up transformer, with that from the Top-Down transformer to obtain a richer tree representation, which is then fed into the above softmax(·) function to make rumor predictions.
Model Training. All our models are trained to minimize the squared error between the probability distribution of the prediction and that of the ground truth: where y c is the ground-truth label andŷ c is the predicted probability of class c, N is the number of trees for training, C is the number of classes, ||.|| 2 is the L 2 regularization term over all the model parameters Θ, and λ is the trade-off coefficient.
During training, parameters are updated through back-propagation (Collobert et al., 2011) with Ada-Grad (Duchi et al., 2011) for speeding up convergence. The training process ends when the model converges or the maximum epoch number is met. We represent input words using pre-trained GloVe Wikipedia 6B word embeddings (Pennington et al., 2014). We set model dimension d to 300 and the dimension for feedforward network is 600. We used 1 and 6 layers of transformer encoder for token-level representation and post-level representation respectively, and set the head number n as 12. The learning rate is initialized as 0.01, and the dropout rate is 0.2.

Datasets
For experimental evaluation, we refer two publicly available tree datasets released by (Ma et al., 2017) and (Kochkina et al., 2018), namely TWITTER and PHEME. In each dataset, a group of source tweets, which form the tree roots, together with their replies are provided in the form of tree structure. Each tree is annotated with one of the four class labels, i.e., non-rumor, true rumor, unverified rumor and false rumor.
To evaluate the robustness of our tree structured detection methods, we consider two types of datasets: propagation trees with shallow depth and trees with deep depth (i.e., complex responsive relations). Therefore, we regroup the trees in each of the datasets into TWO according to the tree depth. Specifically, we split Twitter (PHEME) dataset into TWITTER-S (PHEME-S) and TWITTER-D (PHEME-D), comprised by shallow trees and deep trees respectively. Table 1 displays the basic statistics of the four datasets.

Experimental Setup
For evaluation, we will make comprehensive comparisons between our proposed models and state-of-theart baselines on rumor classification and early detection tasks.
-DT-Rank: (Zhao et al., 2015) proposed a Decision-Tree-based Ranking model to identify trending rumors by searching for inquiry phrases.
-DTC: An information credibility model using a Decision-Tree Classifier (Castillo et al., 2011) using hand-crafted features that are based on the overall statistics of the posts without temporal information.
-RFC: A Random Forest Classfier which used three fitting parameters as temporal properties and a set of hand-crafted features based on user, linguistic and structural properties (Kwon et al., 2013).
-SVM-TK: A SVM classifier that uses a Tree Kernel (Ma et al., 2017) which try to capture propagation structure via kernel learning.
-GRU-RNN: A rumor detection model based on recurrent neural networks (Ma et al., 2016) with GRU for learning rumor representations by modeling sequential structure of relevant posts.
-BU-RvNN and TD-RvNN: The rumor detection models respectively based on bottom-up and topdown RvNN models (Ma et al., 2018) for integrating tweet contents and structure clues.
-PLAN: A rumor detection model based on transformer networks (Khoo et al., 2020) to model long distance interactions between any pair of tweets that oversimplifies responsive relations.
-BU-TRANS, TD-TRANS and HD-TRANS : Our proposed tree transformer models respectively with Bottom-Up, Top-Down and Hybrid manner (see Section. 4).
We implement DT-Rank, DTC and RFC using Weka 3 , SVM-TK using LibSVM 4 and all neuralnetwork-based models with pytorch 5 . We use micro-averaged and macro-averaged F1 score, and classspecific F-measure as evaluation metrics. We hold out 10% of the datasets for tuning the hyper parameters, and conduct 5-fold cross-validation on the rest of the datasets. Table 2 demonstrate the performance of all the compared methods respectively based on the shallow trees and deep trees from TWITTER and PHEME datasets. The results indicate that our proposed methods outperform all the baselines 6 , confirming the advantages of Tree transformer for rumor detection task.

Rumor Classification Performance
It is observed that the performances of the three baselines in the first group based on handcrafted features are obviously poor. RFC perform relatively better because of the usage of additional temporal traits. Among the baselines without feature engineering in the second group, the sequential neural model GRU-RNN without considering structural information performs slightly worse than SVM-TK, because SVM-TK is an integrated kernel that utilize the propagation structure by comparing the trees based on   both textual and structural similarities. Tree-structured neural models, i.e., BU-RVNN and TD-RvNN, make further improvements since it deeply bridge the content semantics and propagation clues. Among all the baselines, PLAN perform best since it leverage the representation power of transformer by modeling dependencies between any two tweets, but this may under-utilize the structural information. In contrast, our proposed TRANS-based models (in the third group), not only inherently leverage propagation structure but also take advantages of the representation power of transformer, thus beat PLAN on the four datasets. Among our three TRANS-based models, BU-TRANS and TD-TRANS perform comparable because both explore tree structure utilizing Transformer. And combing them makes further improvements as HD-TRANS did, suggesting that the learned pattens from the two models are complementary.
Furthermore, when drilling down to the performance of our TRANS-based models on specific datasets, we find that there are distinct observations of model performance between the shallow tree and deep tree. Specifically, on TWITTER-D and PHEME-D datasets, we observe the tree-based baselines (e.g., BU-RvNN and TD-RvNN) perform comparable to PLAN, and the improvements of our models over PLAN range from 5.31%−6.82% (7.34%−9.40%) accuracy (macroF score) on Twitter-D (PHEME-D). The reason is that PLAN is originally proposed and experimented on shallow trees (Khoo et al., 2020), which may not be generalize well on trees with deep and/or complex responsive relationships.
In comparison, on TWITTER-S and PHEME-S dataset, PLAN perform better than TD-RvNN (i.e., the best tree-structured baseline) in a larger margin, and our TRANS-based models improve over PLAN by 1.31%−3.27% (2.33%−3.20%) in terms of accuracy (macroF score) on TWITTER-S (PHEME-S) dataset, which is relatively lower than the improvements made on TWITTER-D and PHEME-D datasets. This is because the homogeneous edges (e.g., majority responsive nodes straightforwardly direct at the source post) in shallow trees have limited identical structure clues for rumor detection. This also verifies the hypothesis we made in Section 1 that tree-structured methods is more effective for deep trees. Debunking rumors at early stage of their propagation is very important so that preventive measures can be taken in a timely manner. In early rumor detection task, we compare different detection methods at a series of elapsed time checkpoints. Figure 3 shows the performance of our HD-TRANS model versus PLAN (the best performed baseline), TD-RvNN (the best tree-structured neural model), RFC (the best system based on feature engineering), and DT-Rank (an algorithm proposed for early rumor detection).

Early Rumor Detection Performance
We observe that within the first few hours, the performance of our HD-TRANS model grows more quickly and starts to supersede the other models at the early stage of propagation. Particularly, HD-TRANS achieves 75.0% (72.3%) accuracy on TWITTER-S (-D) and 65.9% (69.5%) macF score on PHEME-S (-D) within 12 hours. Although all the methods are getting saturated as time goes by, HD-TRANS only need around 14 (12) hours on TWITTER-S (-D) and about 15 (10) hours on PHEME-S (-D), to achieve the comparable performance of the best baseline model (i.e., PLAN), indicating superior early detection performance of our method especially when comes to more complex or deeper propagation pattens. To get an intuitive understanding of what is happening when we use HD-TRANS model, we design an experiment to highlight the nodes with higher attention scores (i.e., "α i " in Eq. 6) at the tree representation layer. Specifically, we sample two trees from TWITTER dataset, i.e., a shallow tree and a deep tree, at the early stage of propagation, that both have been correctly classified as false rumors by our HD-TRANS. In Figure 4, we observe that: 1) the highly ranked nodes with higher attention scores by HD-TRANS (in yellow), illustrated obvious structured rumor-indicative pattens, e.g., denial post spark affirmative replies as x 11 → [x 111 , x 112 ] shows in the deep tree; 2) the nodes attended by PLAN (in green) are generally independent of structure but taking coherent stances or semantics; and 3) the results of HD-TRANS and PLAN are significantly different on the deep tree, but similar results can be found on the shallow tree, implying that more complex propagation pattens can be better captured by our proposed model.

Conclusions and Future Work
In this paper, with the analysis that modeling propagation structure is an essential factor for detecting rumors, we propose three variants of transformer to further enhance the representation learning directed at tree-structured modeling: a Bottom-up transformer, a Top-down tranformer, and a Hybrid model. The results on four benchmark datasets confirm the advantages of our methods as compared to state-of-the-art baselines, especially well-generalized on trees with more complex responsive contexts. For future work, it is promising to include other types of edges/relationships besides the responsive relation to enhance rumor detection, such as friends/followers, quotation, mention, etc. We also plan to investigate the role of non-textual media such as images or videos on the effectiveness of detecting rumors.