Meet The Truth: Leverage Objective Facts and Subjective Views for Interpretable Rumor Detection

Existing rumor detection strategies typically provide detection labels while ignoring their explanation. Nonetheless, providing pieces of evidence to explain why a suspicious tweet is rumor is essential. As such, a novel model, LOSIRD, was proposed in this paper. First, LOSIRD mines appropriate evidence sentences and classifies them by automatically checking the veracity of the relationship of the given claim and its evidence from about 5 million Wikipedia documents. LOSIRD then automatically constructs two heterogeneous graph objects to simulate the propagation layout of the tweets and code the relationship of evidence. Finally, a graphSAGE processing component is used in LOSIRD to provide the label and evidence. To the best of our knowledge, we are the first one who combines objective facts and subjective views to verify rumor. The experimental results on two real-world Twitter datasets showed that our model exhibited the best performance in the early rumor detection task and its rumor detection performance outperformed other baseline and state-of-the-art models. Moreover, we confirmed that both objective information and subjective information are fundamental clues for rumor detection.


Introduction
With the prevalence of social media platforms, rumors have been a serious social problem. Notably, existing rumor detection methods roughly formulate this task as a natural language classification task. The goal of the task is to simply label a given textual claim as rumor or non-rumor. Nevertheless, only a verdict to a suspicious statement is insufficient for people to understand and reason why a claim is a rumor. For example, Fig.1 is the comparison figure of existing rumor detection methods and a rumor detection method that provides evidence. The claim in Fig. 1 is a half-truth, which is highly deceptive. For such rumors, providing a label only is unconvincing. Thus, we believe  Fig.1 (b) are pieces of evidence retrieved from Wikipedia. From those evidence sentences, readers can easily judge if the given claim is a half-truth and clearly understand why that claim is a rumor. that a good rumor detection system should have 2 essential functions including, a rumor identifying function and evidence providing function.
Rumor detection that provides evidence has the following benefits: (1) Improve detection performance.
(2) Improve the user experience. (3) Provide a basis for manual review. (4) Improve the accuracy of early rumor detection. (5) Intercept the spread of similar rumors.
Despite having numerous advantages, rumor detection that provides evidence is extremely hard. If none of the labeled evidence information is included in a rumor detection training dataset, the deep learning network is unlikely to generate these textual evidence contents by itself. Unfortunately, the datasets currently used for rumor detection cannot be used as evidence.

Subjectivity infor
Objective infor easy access need crawl extensive rare one-sidedness comprehensive conflicting consistency has noise high purity To find out what type of information that can be used as evidence, two different kinds of information, subjective information and objective information, are discussed in this part (Merigo et al., 2016;Zorio-Grima and Merello, 2020). Under the field of rumor detection, subjective information refers to source tweets, comments, etc, while objective information refers to the information on Wikipedia or Baidu Encyclopedia, etc. Through our comprehensive analysis, we found that subjective information and objective information shows distinct-different characteristics, which are summarized in Table 1. The objective information is consistency and high purity, can be used as evidence, and the subjective information also contains certain clues for debunking rumors.
To take advantage of both subjective information and objective information, a novel model, LOSIRD, is proposed in this paper. This is notoriously challenging, the difficulties lie in: (1) The model should have a strong retrieval ability.
(2) The model should have a Natural Language Inference(NLI) ability.
(3) The model needs to be able to process the topology information. Fig.2 shows a high-level view of its architecture. This model is divided into two modules i.e., ERM (Evidence Retrieval Module) and RDM (Rumor Detection Module). Inspired by the concept of transfer learning, a two-stage training approach was used for our LOSIRD model. In the first training phase, a widely used fact-checking database was utilized for training the ERM module. In the second training phase, two rumor detection datasets were used to train and evaluate the model.
The main contribution of this paper is four folds: 1. This study, for the first time, arguably proposes a rumor detection model that provides evidence.
2. We are the first to propose two novel graph objects to simulate the propagation lay out of tweets and embedding the relationship of evidence and the claim rumor detection task.
3. Our LOSIRD achieved the highest detection accuracy and outperformed state-of-the-art models in the rumor detection task.
4. Our LOSIRD is more generalizable and robust in the early detection of rumor.
2 Related work 2.1 Evidence Retrieval The evidence retrieval task is highly correlated with the rumor detection task. One of the most widely used datasets for evidence retrieval is FEVER 1 . Majority of researchers handle the fever share task by following the FEVER organizers' pipeline approach, retrieve and verify the evidence in three steps (Hanselowski et al., 2018;Malon, 2018). Zhou et al. (2019a) formulated claim verification as a graph reasoning task and provides two kinds of attention. Liu et al. (2020) presented KGAT combining edge kernels and node kernels to better embedding and filtering the evidence. Zhong et al. (2020) constructed two semantic-level topologies to enhance verification performance. Yoneda et al. (2018) employed a four-stage model for the fever share task.

Rumor Detection
The existing rumor detection deep learning methods can be divided into three categories, featuredriven method, content-driven method, and hybriddriven method. Feature-Driven approaches, like machine learning methods, rely on a variety of characteristics to identify rumors. Rath et al. (2017) proposed a new concept of believability for automatic identification of the users spreading rumors.
Content-Driven approaches are a kind of method base on natural language processing. Many researchers adopted deep learning models to handle this task (Rath et al., 2017;Ma et al., 2016;Chen et al., 2018;Ma et al., 2018). Monti et al. (2019) proposed a propagation-based Fake News Detection by GCN. Nguyen (2019) detected rumor using Multi-modal Social Graph. Sujana et al. (2020) proposed a multiloss hierarchical BiL-STM model for fake news detection. Figure 2: The architecture of our LOSIRD model. The claim and source tweet are essentially the same thing in this paper, "Claim" means the model is training using the Fever dataset, while "source tweet" indicates the PHEME datasets are used.
Hybrid-Driven approaches incorporate both feature engineering and text information representation to detect rumors (Liu and Wu, 2018;. Ruchansky et al. (2017) proposed a model called CSI for rumor detection, which uses articles and extracts user characteristics to debunk rumors. Lu and Li (2020)

Comparison
The highlights of our model include providing evidence, covering two heterogeneous structure graph information, combining both evidence clues and replies information in detecting rumor. Our model exhibited a stronger simulation ability, better scalability, and better persuasive ability.

Problem Statement
We formulated this rumor detection task as a hybrid task that combines the evidence retrieval sub-task and rumor prediction sub-task.
The evidence retrieval sub-task was defined as: Given a claim, the target of this sub-task was to match textural evidence from Wikipedia and reason the relationship between those potential evidence sentences and the given claim as "SUPPORTED", "REFUTED" or "NOT ENOUGH INFO (NEI)". We defined the Wikipedia as an objective information corpus: W iki = {D 1 , D 2 , ..., D |w| }, D i as a document from Wikipedia. One document comprised several sentences describing one entity in Wikipedia. The goal of this sub-task was to retrieve evidence, classify the relationship of the evidence and the given claim C i.e., f ERM : C → {(y e , E); E ∈ W iki}, y e is the predicted evidence label of the claim, E is retrieved evidence set of the claim which contains several sentence-level pieces of evidence.
The rumor prediction sub-task was defined as: Given a claim, that claim's replies, that claim's retrieved evidence set and evidence label, the model detected whether the claim was a rumor or non-rumor and provide the evidence. We defined the rumor dataset in this sub-task as Ψ = where C i is the ith source tweet in the rumor dataset, P i is the source tweet's reply posts, E i and y e i are the corresponding C i retrieved evidence set and evidence label. Given a tweet T the function of this task was defined as f RDM : T → {(y r , E); E ∈ W iki}, y r was the predicted rumor label.

ERM
Mainly following (Liu et al., 2020) and (Hanselowski et al., 2018), we adopted a three-step pipeline module for retrieval evidence, called ERM. The architecture of the ERM is shown in Fig. 3. It contains three main steps ,i.e., Document retrieval: employs a keyword matching algorithm to crawling related files in Wikipedia. Evidence retrieval: extract sentence-level evidence from the retrieved articles. Claim verification: based on the sentence-level evidence, it predicts the relationship of the claim and the evidence as "Supported", "Refuted", or "NEI". Specifically, the ERM first leverages semantic NLP tool-kits to extract potential entities from the given a claim. With the parsed entities, top k highest-ranked Wikipedia articles were filtered by the MediaWiki API. And then, from those retrieved documents, the ERM extracts objective facts as the predicted evidence in the form of sentences that are relevant for the claim. Finally, a verification component of ERM performs prediction over the given statement and retrieved evidence, and verifies the relationship between the claim and evidence as supporting, refuting, or NEI. Fig. 4 shows the structure of RDM. Since the source tweet forms different topology between replies and the evidence, two heterogeneous graph objects, the conversation tree-shaped graph and the evidence star-shaped graph, were structured in RDM.

Two heterogeneous graphs
Conversation tree-shaped structure is a peculiar reply relationship topology that forms naturally from social media and carries a vital clue for rumor detection (Belkaroui et al., 2014;Pace et al., 2016). Of note, the conversation structure is treeshaped. Where the root of the tree is the source tweet, each node represents a comment, and each node is connected by its reply relationship.
Evidence star-shaped structure suggests that each piece of evidence is a supplementary description to the source tweet, hence each evidence sentence directly related to the source tweet forms a star topology. In this star-shaped structure, the node of the source tweet is in the center and all the evidence nodes surround the source tweet representing an angle in the star structure.

Rumor Detection Module
The rumor detection module contains four components: (1) a word vector encoding component.
In RDM, a deep BiLSTM was utilized to extract the information among words and generate a sentence representation. The obtained sentence vector was passed into a graph processing component, the GraphSAGE (Hamilton et al., 2017) model was used as its backbone. The GraphSAGE effectively handled variable graphs. Since the output of the previous component is a set of sentence vectors does not contain structural information. Therefore, before passing this information into the graph processing component two graph objects, the conversation tree-shaped object and the evidence starshaped object were constructed respectively.
The creation of the conversation graph object: where G p is the ith event's conversation graph object, it's vertex set is V p and edge set is E p . The vertex set includes all the post in the event, and the edge set E p means the reply relationship between each post. c and p j are the tweet embedding results from the BiLSTM component, we selected the last hidden state of BiLSTM as a sentence embedding result. The creation of the evidence graph object: where G e is an evidence graph object consisting of a vertex set V e and an edge set E e . The vertex set V e includes a source post c and evidence sentences, while the edge set E e represents the relationship between the evidence and the source post, e k represents the evidence sentence embedding results from the BiLSTM component. At the beginning of the forward propagation step, the feature of each node was assigned to the nodes where h 0 p j , h 0 e k are the initial hidden states of the nodes of the conversation graph object and the evidence graph object in GraphSAGE.
The node's hidden state in GraphSAGE updates by constantly aggregating its immediate neighbors' hidden state, combining them with its own state and generate it's new hidden state.This process makes the nodes gain incrementally richer information (Hamilton et al., 2017): where h k p N (v) , h k e N (v) is the aggregated their neighborhood vectors, k is the depth of the information transmission updates (the number of times the graph information is updated), N is the neighborhood function, N (v) is the set of the node's immediate neighborhood, and AGG pool k is the aggregation function and CON is the concatenation function.
Three aggregators are provided in GraphSAGE, and in this article we chose the Max Pooling aggregator. Here's the formula: where max is the element-wise max operator, and σ is a nonlinear activation function.
After k iterations of information transmission based on the conversation structure and star structure, final representations of the conversation embedding results and evidence embedding results were obtained: p, e are the replies and the evidence of the ith event.  Max aggregator is used to aggreate the information into fixed size. Thereafter, the information of these two parts concatenated together then passed into a multilayer perceptron for the final prediction. The formula is as follows: where V and b y are parameters in the output layer.

Datasets
Fever dataset was used to train the evidence retrieval module. The statistic of the FEVER dataset is shown in Table 2. Two widely used rumor datasets, PHEME 2017 and PHEME 2018 1 , were used to train and evaluate the whole proposed model, as shown in Table 3 2 .

Experimental Setup
To evaluate the rumor detection performance of our model, we compared our proposed models with other popular rumor detection models, including some of the current state-of-the-art models. In the text processing stage, we clean the text information by removing useless expressions and symbols, uniform case, etc. We use Twitter 27B pre-trained GloVe data with 200 dimensions for word embedding and set the maximum vocabulary to 80,000.
For the rumor detection module The hidden size of BiLSTM is 128, and the number of layers is 2. The batch size of graphSAGE is 64. We use Adam with a 0.0015 learning rate to optimize the model, with the dropout rate set to 0.5. For the evidence retrieval, we set the learning rate in ESIM is 0.002, drop out rate is 0, batch size is 64, activation fuction is relu. For the claim verification, we set the the learning rate in ESIM is 0.002, drop out rate is 0.1, batch size is 128,activation fuction is relu.We split the datasets, reserve 10% of the events as the validation set, and the rest in a ratio of 3:1 for training and testing partitions.
• CSI: a state-of-the-art model detecting rumor by scoring users based on their behavior (Ruchansky et al., 2017).
• DEFEND: a state-of-the-art model learns the correlation between the source article's sentences and user profiles (Shu et al., 2019).
• RDM: a state-of-the-art model integrating GRU and reinforcement learning to detect rumors at an early stage (Zhou et al., 2019b).
• CSRD: a state-of-the-art model that detect rumors by modeling conversation structure (Li et al., 2020b).
• LOSIDR: our model, leverages objective facts and subjective views for interpretable rumor detection.

Experimental Results
The main experimental results are shown in Table 4. The LOSIRD outperformed the other bestcompeting methods on PHEME 17 and PHEME 18. Its accuracy was 91.4% in PHEME 2017 and 92.5% in PHEME 2018. Moreover, the precision, recall, and F1 were all higher than 90% in both two datasets. Such promising results confirmed the effectiveness of evidence information and the topology message processing method in rumor detection. For the CNN, BiLSTM, DEFEND, and RDM models, they typically concatenated posts as a single line based on the publish time, while ignoring the  conversation structure information. Nonetheless, the structure was crucial for encoding the posts to comprehensive and precise representations. The CSI and CRNN processed topology information, but only the subjective information was adopted in those models causing insufficiency in information extraction.

Evidence Impact Study
In this section, we discussed whether the evidence facilitates rumor detection and determined the extent of the impact of the evidence in debunking rumor. Notably, the evaluated datasets were the PHEME 2017 and PHEME 2018.

Distribution of Retrieved Evidence
To accurately evaluate the retrieved evidence, the distribution of the retrieved evidence based on its evidence label was analyzed. Two pie charts were constructed to reflect their distribution situations. As shown in Fig. 5, most of the retrieved evidence was irrelevant to the given claim, and about 14.8% of the retrieved evidence sentences had sufficient information that supports or refutes the given claim. Despite the proportion of supports and refutes being not large, this result was commendable and better than our expectations.

Retrieved Evidence Probability Analysis
We further evaluated the impact of evidence by statistically calculating the probability gap between rumor in original data and rumor in data that labeled refutes. The outcome is shown in Table 5. The probabilities of rumor in original data were about 35% in both datasets, while the probabilities of rumor in data that labeled refutes were around 73%, which was much higher than in original data. Specifically, rumor in data that labeled refutes increased to 42.5% on PHEME 17 and 34.3% on PHEME 18. This strongly confirmed that the retrieved evidence was a vital clue for rumor detection.

Influence Analysis of the Evidence on Deep Learning Model
To further illustrate the influence of evidence on rumor detection and analyze the impact of evidence on deep learning models, three NLP models, CNN, BILSTM, and BERT, were deliberately selected as the examination models in this subsection. We concatenated the suspicious claim and its evidence sentences and inputted them into the three models, respectively. The experimental results shows in Fig. 6. The horizontal axis represented a different number of evidence sentences, 0 means only source tweet, while 1 to 5 means source tweet plus 1 to 5 evidence sentences. Also, this paper analyzed the performance before and after the evidence was filtered which was represented as each chart with two lines i.e., one for the unscreened evidence (filter the NEI evidence) and the other for the screened evidence. The broken lines of unscreened in all the charts showed a downward trend. This indicated that the NEI evidence contained a certain amount of useless information there by making the detection process harder. Furthermore, after dropping the NEI evidence, all the models achieved an improvement by an increase of 5% accuracy on average. This demonstrated that the filtered evidence significantly helps the deep learning models in debunking rumors.

Early Detection Performance
To evaluate the early rumor detection performance of our model, 9 test sets that reflected real-world scenarios of rumors spreading on Twitter were created. Each test set included a different number of replies, ranging between 5 replies and 45 replies. The test subset was sampled based on the publication timestamp. As shown in Fig. 7, even though the number of posts was only 5, our LOSIRD model had more than 91% accuracy in PHEME 2017 dataset and PHEME 2018 dataset. Additionally, the broken line diagram showed that the curve of our model was significantly stable, indicating satisfactory robustness and high performance in early rumor detection. Besides, our model effectively made use of the objective information from Wikipedia, hence, it did not rely on subjective information from replies of the users thereby achieving satisfactory performance in the early stage of rumor propagation.

Conclusion
In this paper, we proposed a LOSIRD, a novel interpretative model for rumor detection. Notably, the LOSIRD debunking rumor mechanism depends on both objective facts and subjective views. Objective fact sentences retrieved from 5,416,537 Wikipedia articles were sufficiently utilized to help LOSIRD in analyzing the veracity of a suspicious claim. Meanwhile, the information in subjective views was extracted by simulating the propagation of subjective views based on the conversation structure. Results on two public Twitter datasets showed that our model improved rumor detection performance by a certain margin compared to the state-of-theart baselines. Further, we analyzed the impact of objective facts for rumor detection and analyzed the effectiveness of the conversation structure. The experiments revealed that both objective facts and subjective views were vital clues for debunking rumor. Moreover, we believe that our model will be used for rumor detection and other text classification tasks on social media.