Compare to The Knowledge: Graph Neural Fake News Detection with External Knowledge

Nowadays, fake news detection, which aims to verify whether a news document is trusted or fake, has become urgent and important. Most existing methods rely heavily on linguistic and semantic features from the news content, and fail to effectively exploit external knowledge which could help determine whether the news document is trusted. In this paper, we propose a novel end-to-end graph neural model called CompareNet, which compares the news to the knowledge base (KB) through entities for fake news detection. Considering that fake news detection is correlated with topics, we also incorporate topics to enrich the news representation. Specifically, we first construct a directed heterogeneous document graph for each news incorporating topics and entities. Based on the graph, we develop a heterogeneous graph attention network for learning the topic-enriched news representation as well as the contextual entity representations that encode the semantics of the news content. The contextual entity representations are then compared to the corresponding KB-based entity representations through a carefully designed entity comparison network, to capture the consistency between the news content and KB. Finally, the topic-enriched news representation combining the entity comparison features is fed into a fake news classifier. Experimental results on two benchmark datasets demonstrate that CompareNet significantly outperforms state-of-the-art methods.


Introduction
With the rapid development of the Internet, there are increasingly huge opportunities for fake news * The work was done while visiting Micorosft Research Asia. production, dissemination and consumption. Fake news are news documents that are intentionally and verifiably false, and could mislead readers (Allcott and Gentzkow, 2017). Fake news can easily misguide public opinion, cause the crisis of confidence, and disturb the social order (Vosoughi et al., 2018). It is well known that fake news exerted an influence in the past 2016 US presidential elections (Allcott and Gentzkow, 2017). Thus, it is very important to develop effective methods for early fake news detection based on the textual content of the news document.
Some existing fake news detection methods rely heavily on various hand-crafted linguistic and semantic features for differentiating between news documents (Conroy et al., 2015;Rubin et al., 2016;Rashkin et al., 2017;Khurana and Intelligentie, 2017;Shu et al., 2020). To avoid feature engineering, deep neural models such as Bi-LSTM and convolutional neural networks (CNN) have been employed (Oshikawa et al., 2020;Wang, 2017;Rodríguez and Iglesias, 2019). However, they fail to consider the sentence interactions in the document. Vaibhav et al. showed that trusted news and fake news have different patterns of sentence interactions (Vaibhav et al., 2019). They modeled a news document as a fully connected sentence graph and proposed a graph attention model for fake news detection. Although these existing approaches can be effective, they fail to fully exploit external KB which could help determine whether the news is fake or trusted.
External KB such as Wikipedia contains a large amount of high-quality structured subjectpredicate-object triplets and unstructured entity descriptions, which could serve as evidence for detecting fake news. As shown in Figure 4, the news document about "mammograms are not effective at detecting breast tumors" is likely to be detected as fake news with the knowledge that " The goal of mammography is the early detection of breast cancer" in the Wikipedia entity description page 1 . Pan et al. proposed to construct knowledge graphs from positive and negative news, and apply TransE to learn triplet scores for fake news detection (Pan et al., 2018). Nevertheless, the performance is largely influenced by construction of the knowledge graph. In this paper, to take full advantage of the external knowledge, we propose a novel endto-end graph neural model CompareNet which directly compares the news to the KB through entities for fake news detection. In CompareNet, we also consider using topics to enrich the news document representation for improving fake news detection, since fake news detection and topics are highly correlated (Zhang et al., 2020;Jin et al., 2016). For example, the news documents in the "health" topic are inclined towards false, while the documents belonging to the "economy" topic are biased to be trusted instead.
Particularly, we first construct a directed heterogeneous document graph for each news document, containing sentences, topics and entities as nodes.The sentences are fully connected in bidirection. Each sentence is also connected with its top relevant topics in bi-direction. If a sentence contains an entity, one directed link is built from the sentence to the entity. The reason for building one-way links from sentences to entities is to ensure that we can learn contextual entity representations that encode the semantics of the news, while avoiding the influence of the true entity knowledge to the news representation. Based on the directed heterogeneous document graph, we develop a heterogeneous graph attention network to learn topic-enriched news representations and contextual entity representations. The learned contextual entity representations are then compared to the corresponding KB-based entity representations with a carefully designed entity comparison network, in order to capture the semantic consistency between the news content and external KB. Finally, the topic-enriched news representations and the entity comparison features are combined for fake news classification. To facilitate related researches, we release both our code and dataset to the public 2 . 1 https://en.wikipedia.org/wiki/Mammography 2 https://github.com/ytc272098215/FakeNewsDetection In summary, our main contributions include: 1) In this paper, we propose a novel end-to-end graph neural model CompareNet which compares the news to the external knowledge through entities for fake news detection.
2) In CompareNet, we also consider the useful topic information. We construct a directed heterogeneous document graph incorporating topics and entities. Then we develop heterogeneous graph attention networks to learn topicenriched news representations. A novel entity comparison network is designed to compare the news to the KB.
3) Extensive experiments on two benchmark datasets demonstrate that our model significantly outperforms state-of-the-art models on fake news detection by effectively incorporating external knowledge and topic information.

Related Work
Fake news detection has attracted much attention in recent years (Zhou and Zafarani, 2020;Oshikawa et al., 2020). A lot of works also focus on the related problem, i.e., fact checking, which aims to search evidence from external knowledge to verify the veracity of a claim (e.g., a subject-predicateobject triple) (Thorne et al., 2018;Zhong et al., 2020). Generally, fake news detection usually focuses on news events while fact-checking is broader (Oshikawa et al., 2020). The approaches for fake news detection can be divided into two categories: social-based and content-based.

Social-based Fake News Detection
Social context related to news documents contains rich information such as user profiles and social relationships to help detect fake news. Social based models basically include stance-based and propagation-based. Stance-based models utilize users' opinions to infer news veracity (Jin et al., 2016;Wu et al., 2019). Tacchini et al. constructed a bipartite network of user and posts with 'like' stance information, and proposed a semisupervised probabilistic model to predict the likelihood of posts being hoaxes (Tacchini et al., 2017). Propagation-based approaches for fake news detection are based on the basic assumption that the credibility of a news event is highly related to the credibilities of relevant social media posts. Both homogeneous (Jin et al., 2016) and heterogeneous credibility networks (Gupta et al., 2012;Shu et al., 2019;Zhang et al., 2020) have been built to model the propagation process. For instance, (Zhang et al., 2020) constructed a heterogeneous network of news articles, creators and news subjects, and proposed a deep diffusive network model for incorporating the network structure information to simultaneously detect fake news articles, creators and subjects.

Content-based Fake News Detection
On the other hand, news contents contain the clues to differentiate fake and trusted news. A lot of existing works extract specific writing styles such as lexical and syntactic features (Conroy et al., 2015;Rubin et al., 2016;Khurana and Intelligentie, 2017;Rashkin et al., 2017;Shu et al., 2020;Oshikawa et al., 2020) and sensational headlines (Potthast et al., 2018;Sitaula et al., 2019) for fake news classifier. To avoid hand-crafted feature engineering, neural models have been proposed (Wang, 2017;Rodríguez and Iglesias, 2019). For example, Ibrain et al. applied deep neural networks, such as Bi-LSTM and convolutional neural networks (CNN) for fake news detection (Rodríguez and Iglesias, 2019). However, these works fail to consider different sentence interaction patterns between trusted and fake news documents. Vaibhav et al. proposed to model a document as a sentence graph capturing the sentence interactions and applied graph attention networks for learning document representation (Vaibhav et al., 2019). Pan et al. proposed to construct knowledge graphs from positive and negative news, and apply TransE to learn triplet scores for fake news detection (Pan et al., 2018). Nevertheless, they relied heavily on the quality of the construction of knowledge graphs. In this paper, we propose a novel graph neural model Com-pareNet which directly compares the news to external knowledge for fake news detection. Considering that the detection of fake news is correlated with topics, we also use topics to enrich the news representation for improving fake news detection.
Some works (Wang, 2017;Khattar et al., 2019; also consider incorporating multi-modal features such as images for improving fake news detection.

Our Proposed CompareNet
In this section, we detail our proposed fake news detection model CompareNet, which directly compares the news to external knowledge for fake news detection. As shown in Figure 2, we also consider topics for enriching news representation since fake news detection is highly correlated with topics (Zhang et al., 2020). Specifically, we first construct a directed heterogeneous document graph for each news document incorporating topics and entities as shown in Figure 1. The graph well captures the interactions among sentences, topics and entities. Based on the graph, we develop a heterogeneous graph attention network to learn the topic-enriched news representation as well as the contextual entity representations that encode the semantics of the news document. To fully leverage external KB, we take the entities as the bridge between the news document and the KB. We compare the contextual entity representations with the corresponding KB-based entity representations using a carefully designed entity comparison network. Finally, the obtained entity comparison features are combined with the topic-enriched news document representation for fake news detection.

Directed Heterogeneous Document Graph
For each news document d, we construct a directed heterogeneous document graph G = (V, E) incorporating topics and entities, as shown in Figure  1. There are three kinds of nodes in the graph: sentences S = {s 1 , s 2 , · · ·, s m }, topics T = {t 1 , t 2 , · · ·, t K } and entities E = {e 1 , e 2 , · · ·, e n }, i.e., V = S ∪ T ∪ E. The set of edges E represent the relations among sentences, topics and entities. The details of constructing the graph are described as follows.
We first split the news document as a set of sentences. Sentences are bidirectionally connected with each other in the graph, capturing the interaction of each sentence with every other sentence. Since topic information is important for fake news detection (Zhang et al., 2020), we apply the unsupervised LDA (Blei et al., 2003) (the total topic number K is set as 100) to mine the latent topics T from all the sentences of all the documents in our dataset. Specifically, each sentence is taken as a pseudo-document and is assigned to the top P relevant topics with the largest probabilities. Thus, each sentence is also connected with its top P assigned topics in bi-direction, allowing the useful topic information to propagate among the sentences. Note that we can also deal with new coming news documents by inferring the topics with trained LDA. We identify the entities E in the document d and map them to Wikipedia using the entity linking tool TAGME 3 . If a sentence s contains an entity e, we build a one-way directed edge from a sentence to the entity e, in order to allow only information propagation from sentences to entities. In this way, we can avoid integrating true entity knowledge directly into news representation, which may mislead the detection of fake news.

Heterogeneous Graph Convolution
Based on the above directed heterogeneous document graph G, we develop a heterogeneous graph attention network for learning the news representation as well as the contextual entity representations. It considers not only the weights of different nodes with different types (Hu et al., 2019) but also the edge directions in the heterogeneous graph.
Formally, we have three types T = {τ 1 , τ 2 , τ 3 } of nodes: sentences S, topics T and entities E with different feature spaces. We apply LSTM to encode a sentence s = {w 1 , · · ·, w m } and get its feature vector x s ∈ R M . The entity e ∈ E is initialized with the entity representations e KB ∈ R M learned from the external KB (see Subsection 3.3.1). The topic t ∈ T is initialized with one-hot vector x t ∈ R K .
Next, consider the graph G = (V, E) where V and E represent the set of nodes and edges respectively. Let X ∈ R |V|×M be a matrix containing the nodes with their features x v ∈ R M (each row x v is a feature vector for a node v). A and D are the adjacency matrix and the degree matrix, respectively. The heterogeneous convolution layer updates the (l + 1)-th layer representation of the nodes H (l+1) by aggregating the features of their neighboring nodes H (l) τ with different types τ . (Initially, H (0) = X): where σ(·) denotes the activation function. Nodes with different types τ have different transformation matrix W (l) τ . The transformation matrix W (l) τ considers the different feature spaces and projects them into an implicit common space. B τ ∈ R |V|×|Vτ | is the attention matrix, whose rows represent all the nodes and columns represent their neighboring nodes with the type τ . Its element β vv in the v-th row and the v -th column is computed as follows: where ν is the attention vector and α τ is the typelevel attention weight. h v and h v are respectively the representation of the current node v and its neighboring node v . Softmax function is applied to normalize across the neighboring nodes of node v.
We calculate the type-level attention weights α τ based on the current node embedding h v and the type embedding h τ = v Ã vv h v (the weighted sum of the neighboring node embeddings h v with the type τ , where the weight matrixÃ = D − 1 2 (A+ I)D − 1 2 is the normalized adjacency matrix with added self-connections) as follows: where µ τ is the attention vector for the type τ . Softmax function is applied to normalize across all the types. After L-layer heterogeneous graph convolution, we can finally get all the node (including sentences and entities) representations aggregating neighborhood semantics. We use max pooling over the representations of the sentence nodes H s ∈ R N to obtain the final topic-enriched news document embedding H d ∈ R N . The learned entity representations that encode the contextual semantics of the document are taken as contextual entity representations e c ∈ R N .

Entity Comparison Network
In this subsection, we detail our entity comparison network which compares the learned contextual entity embeddings e c to the corresponding KBbased entity embeddings e KB . We believe entity comparison features could improve fake news detection based on the assumption that e c learned from trusted news document can be better aligned with the corresponding e KB ; while inverse for fake news.

KB-based Entity Representation
We first illustrate how to take full advantage of both structured subject-predicate-object triplets and unstructured textual entity descriptions in the KB (i.e., Wikipedia) to learn KB-based entity representations e KB .
Structural Embedding. A wide range of knowledge graph embedding methods can be applied to obtain structured entity embeddings. Due to the simplicity of TransE (Bordes et al., 2013), we adopted TransE to learn entity representations e s ∈ R M from the triplets. Formally, given a triplet (h, r, t), TransE regards a relationship r as a translation vector r from the head entity h to the tail entity t, namely h + r = t.
Textual Embedding. For each entity, we take the first paragraph of the corresponding Wikipedia page as its text description. Then we apply LSTM (Hochreiter and Schmidhuber, 1997) to learn entity representations e d ∈ R M that encode the entity descriptions.
Gating Integration. Since both the structural triplets and textual description provide valuable information for an entity, we integrate these information into a joint representation. Particularly, as we have the structural embedding e s and textual embedding e d , we adopt a learnable gating function to integrate entity embeddings from the two sources. Formally, where g e ∈ R M is a gating vector (w.r.t. the entity e) to trade-off information from the two sources and its elements are in [0, 1]. denotes elementwise multiplication. The gating vector g e means that each dimension of e s and e d are summed by different weights. To constrain the value of each element in [0, 1], we compute the gate g e with the Sigmoid function: whereg e ∈ R M is a real-value vector and is learned in the training process. After fusing the two types of embeddings with the gating function, we obtain the final KB-based entity embeddings e KB ∈ R M which encode both structural information from the triplets and textual information from the entity descriptions in the KB.

Entity Comparison
We then perform entity-to-entity comparison between the news document and the KB, to capture the semantic consistency between the news content and the KB. We calculate a comparison vector a i between each contextual entity representation e c ∈ R N and its corresponding KB-based entity embedding e KB ∈ R M .
where f cmp () denotes the comparison function, and W e ∈ R N ×M is a transformation matrix. To measure the embedding closeness and relevance (Shen et al., 2018), we design our comparison function as: where W a ∈ R N ×2N is a transformation matrix and is hadamard product, i.e., element-wise product. The final output comparison feature vector C ∈ R N is obtained by the max pooling over the alignment vectors A = [a 1 , a 2 , ..., a n ] of all the entities E = {e 1 , e 2 , ..., e n } in the news document.

Model Training
After obtaining the comparison vector C ∈ R N and the final news document representation vector H d ∈ R N , we concatenate and feed them into a Softmax layer for fake news classification. Formally, where W o and b o are the parameter matrix and vection of a linear transformation. During model training, we exploit the cross-entropy loss over the training data with the L2-norm of the parameters: where D train is the set of news documents for training, Y is the corresponding label indicator matrix, Θ is the model parameters, and η is regularization factor. For model optimization, we adopt the gradient descent algorithm.

Experiments
We conduct extensive experiments across various settings and datasets. Following the previous work (Vaibhav et al., 2019), we use SLN: Satirical and Legitimate News Database (Rubin et al., 2016), and LUN: Labeled Unreliable News Dataset (Rashkin et al., 2017) for our experiments. Table 1 shows the statistics. Our baseline models include deep neural models: LSTM (Hochreiter and Schmidhuber, 1997), CNN (Kim, 2014), BERT+LSTM (Vaibhav et al., 2019) (BERT for sentence encoder and then LSTM for document encoder) and BERT (Devlin et al., 2019) (directly for document encoder). We also compare our model with graph neural models: GCN and GAT based on an undirected fully-connected sentence graph, which use attention pooling or max pooling for learning news document representation. For fair comparison with the previous work (Vaibhav et al., 2019), we use LSTM to encode sentences with randomly initialized word embeddings, which is the same as all the graph neural baselines. We run our model 5 times and report the micro-averaged (Precision = Recall = F1) and macro-averaged scores (Precision, Recall, F1) in all the settings including 2-way and 4-way classification.
2-way classification: We use the satirical and trusted news articles from LUN-train for training, LUN-test for validation and evaluate our model on the entire SLN dataset. This is done to emulate a real-world scenario where we want to see the performance of our model on an out-of-domain dataset.
4-way classification: We split the LUN-train into a 80:20 split to create our training and validation set. We use the LUN-test as our in-domain test set.
Experimental Setting. In our experiments, we set the number of topics K = 100 in LDA. Each sentence is assigned to top P = 2 topics with the largest probabilities. The layer number of our heterogeneous graph convolution is set as L = 1. These parameters are chosen according to the best experimental results on validation set. The other hyper-parameters are set as the same as the baseline (Vaibhav et al., 2019) for fair comparison. Specifically, all the hidden dimensions used in our model are set as M = 100. The node embedding dimension N = 32. For GCN, GAT and CompareNet, we set the activation function as LeakyRelU with slope 0.2. For model training, we train the models for a  maximum of 15 epochs and use Adam optimizer with learning rate 0.001. We set L2 normalization factor η as 1e-6. Table 2 shows the results for the two-way classification between satirical and trusted news articles. We report only micro F1 since micro Precision=Recall=F1. As we can see, our proposed model CompareNet significantly outperforms all the state-of-the-art baselines in terms of all the metrics. Compared to the best baseline model, CompareNet improves both micro F1 and macro F1 by nearly 3%. We can also find that the graph neural network based models GCN and GAT all perform better than the deep neural models including CNN, LSTM and BERT. The reason is that the deep neural models fail to consider the interactions between sentences, which is important for fake news detection since different interaction patterns are observed in trusted and fake news documents (Vaibhav et al., 2019). Our model Com-pareNet further improves fake news detection by effectively exploiting the topics as well as the external KB. The topics enrich the news representation, and the external KB offers evidences for fake news detection. We also present the results of four-way classification in Table 3. Consistently, all graph neural models capturing sentence interactions outperform the deep neural models. Our model CompareNet achieves the best performance in terms of all metrics. We believe that our model CompareNet benefits from the topics and external knowledge.

Ablation Study
In this subsection, we conduct experiments to study the effectiveness of each module in CompareNet and the way we incorporate external knowledge. We study the average performance of 5 runs on the LUN-test set. As shown in Table 4, we test the performance of CompareNet removing structured triplets, removing the entire external knowledge, removing topics, and removing both topics and external knowledge. In the last two rows, we further     information is as important as the external knowledge. Removing both topics and external knowledge (i.e., w/o Both) will lead to substantial performance drop (4.0-5.0%). It demonstrates the importance of both topics and external knowledge. The variant model CompareNet (undirected) although incorporating both topics and external knowledge achieves lower performance than CompareNet w/o Entity Cmp and CompareNet w/o Topics. The reason could be that CompareNet (undirected) directly aggregates the true entity knowledge into the news representation in graph convolution without considering the directed edges, which misleads the classifier for differentiating fake news. This verifies the appropriateness of our constructed directed heterogeneous document graph. The last variant Com-pareNet (concatenation) also performs lower than CompareNet w/o Entity Cmp, further indicating that directly concatenating true entity knowledge is not a good way for incorporating entity knowledge. Its performance drops by around 2.0% compared to CompareNet. These demonstrate the effectiveness of the carefully designed entity comparison network in CompareNet. Figure 3 shows the performance (micro and macro F1) of our model CompareNet on LUN validation set with different number of top assigned topics P to each sentence. As we can see clearly, micro F1 and macro F1 first consistently rises with the increase of P and then drops when P is larger than 2. This may because that connecting too many lowprobability topics will introduce some noise. Thus, in our experiments, we set P = 2.

Case Study
To further illustrate why our model outperforms state-of-the-art baseline GAT+Attn (Vaibhav et Figure 4, the content of the news document is in conflict with the entity description from Wikipedia. Specifically, the news about "FDA target and threaten the natural health community" delivers contrary meaning from the entity description that "FDA is responsible for protecting and promoting public health" 4 . Similarly, the news document about "mammograms are not effective at detecting breast tumors" conveys different meaning from the entity description of "mammograms". We believe that our model CompareNet benefits from the comparison to Wikipedia knowledge by the entity comparison network. We find there are also unsuccessful cases since an entity could be mistakenly linked to a wrong entity in the Wikipedia.

Conclusion
In this paper, we propose a novel end-to-end graph neural model CompareNet which compares the news to the external knowledge for fake news detection. Considering that the detection of fake news is correlated with topics, in our model, we also use topics to enrich the news document representation for improving fake news detection. Particularly, we first construct a directed heterogeneous document graph for each news document capturing the interactions among sentences, topics and entities.
Based on the graph, we develop a heterogeneous graph attention network for learning topic-enriched news representation as well as contextual entity representations that encode the semantics of the content of the news document. To capture the semantic consistency of the news content and the KB, the learned contextual entity representations are then compared to the KB-based entity representations, with a carefully designed entity comparison network. Finally, the obtained entity comparison features are combined with the news representation for an improved fake news classifier. Experiments on two benchmark datasets have demonstrated the effectiveness of the way we incorporate the external knowledge and topics.
In future work, we will explore a better way to combine multi-modal data (e.g., images) and external knowledge for fake news detection.