KAPALM: Knowledge grAPh enhAnced Language Model for Fake News Detection

,


Introduction
In recent years, with the development of the Internet, social media such as Facebook and Twitter have become the main platforms for people to consume news due to their real-time and easy access.However, social media is a double-edged sword and it enables people to be exposed to fake news.The wide spread of fake news can misguide public opinion, threaten people's health, and even cause devastating effects on society (Vosoughi et al., 2018).Thus, the research on automatic fake news detection is desirable.
Prior researches are based on traditional machine learning methods and hand-crafted features based on news content to combat fake news on social media (Castillo et al., 2011;Kwon et al., 2013;Zubiaga et al., 2017;Przybyla, 2020;).To avoid manual features, deep neural networks models such as Convolutional Neural Networks (CNN) and Recurrent Neural Network (RNN), have been used to learn the high-level feature representations automatically and achieve great performance in detecting fake news (Ma et al., 2016;Wang, 2017;Shu et al., 2019).For the past few years, Pre-trained language models (PLMs), such as BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019), have been the mainstream approaches which provide the contextualized representation of textual content in natural language processing field and also achieve the competitive performance for fake news detection (Pelrine et al., 2021;Sheng et al., 2021).However, these works fail to consider the knowledge entities in news articles.Although news content is condensed and full of knowledge entities by which people usually verify the veracity of news content, PLMs are not effective in capturing the knowledgelevel relationship between entities.As shown in the left of Figure 1, it is hard to detect the veracity of news article exactly without the entity knowledge about Dennis Bray and Hans von Storch.
Large Knowledge Graphs (KGs), such as Freebase (Bollacker et al., 2008) and Wikidata (Vrandecic and Krötzsch, 2014), contain a large number of structured triplets, which can serve as the entity knowledge base for detecting fake news.Existing researches have demonstrated the significant role of KGs (Pan et al., 2018;Dun et al., 2021;Jiang et al., 2022).Specifically, Dun et al. (2021) proposed knowledge attention networks to measure the importance of knowledge and incorporate both semantic-level and knowledge-level representations of news content.Jiang et al. (2022) proposed a knowledge prompt learning method which incorporated KGs into the prompt representation to make it more expressive for verbal word prediction.Hu et al. (Hu et al., 2021) explored the structural entity embedding and compared contextual entity representations with the corresponding entity representations in the knowledge graph.However, they focus on the textual representation of entities in KGs and have not explored the in-depth structured knowledge associated with the new article.As shown in Figure 1, we construct a subgraph of KGs for the news article which is named as entity graph throughout this paper, and this entity graph is helpful to detect fake news with the knowledge that "Dennis Bray is a biologist and Hans von Storch is a climatologist and meteorologist".
In this paper, we proposed a Knowledge grAPh enhAnced Language Model (KAPALM) which is enabled to explore both coarse-and fine-grained knowledge in KGs for fake news detection.Firstly, we identify entities from news content and link them to the large-scale KGs.An entity graph is constructed by extracting the first-order neighbors of identified entities, and fed into a Graph Neural Network (GNN) to obtain coarse-grained structured knowledge representations.Then, we prune the entity graph and feed it into the attentive graph pooling layer for fine-grained knowledge representations.Finally, we integrate coarse-and fine-grained knowledge representations with the representation of news content encoded by PLMs for fake news detection.The primary contributions of this paper can be concluded as follows: • We propose a method to construct an entity graph for each news article.Our method is enabled to extract coarse-grained structured knowledge from the entity graph and provide fine-grained knowledge by the pruned entity graph.
• We propose a Knowledge grAPh en-hAnced Language Model (KAPALM) to integrate the coarse-and fine-grained knowledge representations with the representation of news content for fake news detection.
• We compare our model with the state-of-theart methods on two benchmark datasets in few-shot and full-scale settings.The experimental results demonstrate that our model outperforms the state-of-the-art methods in the full-scale setting.Moreover, our model is competitive in the few-shot setting.
2 Related Work

Fake News Detection
In this section, we will briefly introduce the related work on fake news detection.This work (Zhou and Zafarani, 2021) outlines early methods which design handcrafted features and utilize statistical machine learning methods to detect the authenticity of a given news.These work (Castillo et al., 2011;Przybyla, 2020;Kwon et al., 2013;Zubiaga et al., 2017) proposed to utilize statistical information to leverage textual and social features to detect fake news.Also, some works (Ajao et al., 2019;Zhang et al., 2021) focus on capturing sentiment features to better detect fake news.Next, we will explore relevant research on the application of deep learning techniques in the task of detecting fake news.With the development of deep learning techniques, some models such as CNN, RNN, and Transformer (Kim, 2014;Liu et al., 2016;Vaswani et al., 2017) are being used for fake news detection (Dun et al., 2021;Ma et al., 2016;Samadi et al., 2021;Shu et al., 2019;Jiang et al., 2022).Dun et al.(2021) proposed a approach which applied a knowledge attention network to incorporate external knowledge based on Transformer.Jiang et al.(2022) proposed KPL which incorporated prompt learning on the PLMs for the first time and enriched prompt template representations with entity knowledge and achieved state-of-the-art performance on two benchmark datasets, PolitiFact (Shu et al., 2020) and Gossipcop (Shu et al., 2017).

Knowledge Graphs
External knowledge can provide necessary supplementary information for detecting fake news.It is stored in a knowledge graph format, which contains information about entities in the real world, such as people, places, etc. Nowadays, knowledge graphs are used in various natural language processing applications, such as news recommendation (Wang et al., 2018), fact verification (Zhong et al., 2020;Zhou et al., 2019), and fake news detection (Dun et al., 2021;Jiang et al., 2022).Among them, Dun et al.(2021) designed two attention mechanisms of news towards entities and news towards entities and entity context to capture the different importance of entities in detecting fake news.Jiang et al.(2022) utilized a prompt template and integrated entities extracted from news articles into the prompt representation for fake news detection.Hu et al. (Hu et al., 2021) proposed a novel graph neural model, which compared the news to the knowledge graph through entities for fake news detection.The limitations of the these methods are that they all focus on news entity information and ignore the structured information of KGs associated with the news article.Our proposed model can learn knowledge from large-scale PLMs, acquire entity and topological structure information from external KGs, and improve the performance of fake news detection.

Model
We illustrate the framework of KAPALM in Figure 2. The input of KAPALM consists of the news content and the subgraph constructed from KGs.The output of the model is the label for binary classification of fake news detection.First, for each piece of news, we use a pre-trained language model with the adapter layer (Pfeiffer et al., 2020) to encode the text of news content.Next, entity linking is used to identify entities in new content and we extract these entities and their corresponding first-order neighbors in the knowledge graph to construct the coarse-grained entity graph of the news.Entity extraction and entity graph construction are described in the knowledge extraction section.Then, we feed the constructed entity graph into a graph attention network (Velickovic et al., 2017) to generate a new entity graph that incorporates both the knowledge graph and entity information.We use the interaction node in the entity graph to represent coarsegrained entity knowledge.Furthermore, to obtain fine-grained entity knowledge representation, we prune the constructed entity graph and then use an attention mechanism for graph pooling.Finally, we concatenate the representations of the news content, coarse-and fine-grained entity knowledge representation, and feed them into a fully connected neural network to obtain the prediction for fake news detection.

Text Encoder
This module is to generate the representation of the news content.We utilize a pre-trained language model with Adapter (Pfeiffer et al., 2020) to encode the news content for capturing the semantic information contained in the news article.The pretrained language model is usually pre-trained on a large amount of textual corpus, and is able to capture the semantics of the news content.However, when the pre-training task is not similar to the downstream task, fine-tuning pre-trained language model is required to achieve the state-of-the-art performance on the downstream task.In addition, the large number of parameters in pre-trained language model can lead to the high time and space costs during the full fine-tuning.Thus, adapter turning is used as a lightweight fine-tuning strategy that achieves competitive performance to full fine-tuning on downstream tasks.Adapters add a small set of additional parameters at every layer of Transformer (Vaswani et al., 2017).The parameters of pre-trained language model are kept frozen, and the parameters of adapters are trained during fine-tuning.The adapter layer can be represented as the following equation: where W up and W down are parameter matrixes.

Entity Graph Construction and Pruning
This module is used to extract relevant entities from the knowledge graph and construct two subgraphs.Figure 3 illustrates the pipeline process which includes entity linking, entity graph construction, and entity graph pruning.Firstly, we use entity linking (Milne and Witten, 2008;Sil and Yates, 2013) to extract the entities mentions from the news content, align them with entities in the knowledge graph, and obtain their first-order neighbors in the knowledge graph.As a result, we obtain the entity set E = {e 1 , e 2 , ..., e n } and its first-order neighbor set N = e i ∈E N e i , where N e i is the first-order neighbor set of e i .Secondly, we construct the entity graph G.In order to better aggregate entity graph information, we create an interaction node that is connected to every e i ∈ E, and then connect each e i to its corresponding first-order neighbors in N e i .Due to the large number of entities in the constructed entity graph G, the information associated with relevant entities in news content cannot be effectively captured.Therefore, we prune the entity graph to retain only the neighbors on one path of a pair of entities.In other words, we remove the first-order neighbors with a degree of one in the entity graph.In this case, we obtain the pruned entity graph G ′ , which consists of all entities in E and their remained first-order neighbors N ′ = e i ∈E N ′ e i , where N ′ e i is remained firstorder neighbors of e i .

Coarse-grained Knowledge
In this module, we aim to obtain coarse-grained knowledge representations for entities in the entity graph.After the construction of the entity graph, we use PLMs to initialize the representations of nodes.While the representation of the interaction node is initialized with the embedding of [CLS] token for the news content, the initialization of other entity nodes is provided by encoding the text of the entity name.As shown in the bottom of Figure 2, the initialized entity graph is fed into a Graph Attention Network (Velickovic et al., 2017) to aggregate the information among all entities.The representation of interaction node is enabled to integrate entity knowledge with the contextual representation of news content.Because the entity graph includes many noisy entities, we input only the representation of interaction node which is named the coarse-grained knowledge representation.The hidden representation of coarse-grained knowledge a is calculated as follows: where G denotes the entity graph and function Interaction returns the representation of the interaction node.

Fine-grained Knowledge
After pruning the entity graph, we feed the pruned graph into the attentive graph pooling layer to extract the fine-grained entity knowledge representation.Entities in the entity graph, especially those first-order neighbor entities, do not have the same role in detecting the veracity of news articles.Therefore, we propose to utilize the attentive graph pooling layer which is based on the multi-head attention mechanism to measure the importance of different entities to fake news detection.The output of the attentive graph pooling layer is called the fine-grained knowledge representation.
Attentive Graph Pooling Layer As shown in Figure 2, the query is the hidden state of the news article encoded by PLMs, while both the key and value have the same representation which is derived from the hidden states of e i ∈ E N ′ where E and N ′ denote the entity set and the remained neighbour set respectively.Each entity is assigned a corresponding weight by calculating the similarity between the news and this entity.The attention formula is shown as follows: where h and c denote the hidden states of a news article and an entity in E N ′ respectively, and W Q , W K and W V are parameter matrices.

Deep Neural Network Classifier
We concatenate the representations of the news content h, coarse-grained knowledge representation a, and fine-grained knowledge representation s to obtain the final representation z.Then, z is fed into a fully connected layer followed by a softmax function to predict the probability distribution P of the news article's label.
The model is trained to minimize the following cross-entropy loss function.
where T refers to the training dataset, P i and Y i denote the distributions of the prediction and true label of the sample i, λ denotes the coefficient of L2 regularization, and Θ denotes the model parameters.

Experiments 4.1 Datasets
To evaluate the proposed model, we conduct experiments on two datasets PolitiFact and Gossipcop that are included in a benchmark datasets called FakeNewsNet (Shu et al., 2020).The detail of these news datasets are shown in Table 1.We study our proposed model in both few-shot and full-scale settings.
Few-shot settings For the purpose of replicating low-resource situations in real-world scenarios, we randomly select k ∈(2, 4, 8, 16, 100) news articles as the training set and create the validation set of the same size.The other news articles are used as the test set.We follow (Jiang et al., 2022), and sample the few-shot data by 10 random seeds and use the average value calculated after deleting the maximum and minimum scores as the final score.
Full-scale settings For a dataset, we reserve 10% of the news articles as the validation set, and 5fold cross validation is conducted on the other new articles.Finally, the average score is reported.

Implementation Details
TagMe (Ferragina and Scaiella, 2010) is adopted to extract knowledge mentions from news articles.We utilize the BERT-base version with adapter as the pre-trained language model (Pfeiffer et al., 2020) F1) is used as the evaluation metric to measure the classification performance.

Baselines
We compare our proposed model with the following baselines: (1) DTC (Castillo et al., 2011): DTC is the decision tree model, which detects the authenticity of news by utilizing hand-crafted features.
(2) RFC (Kwon et al., 2013): RFC is the random forest classifier based on hand-crafted features to detect whether news is true or fake.
(3) SVM (Yang et al., 2012): SVM denotes a classification model that uses a hyperplane to separate news into true and fake news in high dimensional feature space.
(4) TextCNN (Kim, 2014): TextCNN is a popular deep learning model for text classification, which applies convolutional filters with various window sizes to extract text features.These features are fed into pooling layer to furtherly capture the most salient features to judge whether news is true or fake.
(5) BiLSTM (Bahad et al., 2019;Hochreiter and Schmidhuber, 1997): BiLSTM denotes the bidirectional long short-term memory which introduces a two-directional recurrent netural network architecture to better capture the temporal dependencies in text data.We utilize it to judege news's authenticity.
(6) KCNN (Wang et al., 2018): KCNN is a CNNbased model which concatenates news embedding and knowledge entities embedding to learn the representation of news.
(8) FB (Peters et al., 2018): FB denotes the feature-based method of utilizing the pretrain language model for feature extraction.BERT base version is used in the experiments.
(10) KAN (Dun et al., 2021): KAN is a knowledgeaware attention network which incorporates external knowledge entities through attention mechanisms to predict the veracity of news articles.
(11) KPL (Jiang et al., 2022): KPL is the state-ofart model which designs one prompt template and incorporates external knowledge entities into the prompt representation.

Experimental Results
Experiments are conducted in both few-shot and full-scale settings on two datasets.Baseline models are divided into five categories: traditional statistical methods (i.e., DTC, RFC, SVM), neural network methods without external knowledge (i.e., TextCNN, BiLSTM), neural network methods with external knowledge (i.e., KCNN, KLSTM, KAN), pre-trained language model methods without external knowledge (i.e., FB, FFT), and pre-trained language model methods with external knowledge (i.e., KPL).
The experimental results are presented in Table 2.We draw some conclusions in the few-shot setting.First, although the performance is not stable when the number of training data varies, we observe that the increase of training data usually gives rise to the improvement.Second, the deep learning methods without external knowledge are usually worse than the statistical methods.The cause may be that  than KPL in the few-shot setting on the PolitiFact dataset.
In the full-scale setting, our method outperforms all baselines on two datasets and achieves the highest F1 score.Specifically, KPL is the recent state-of-art model that incorporates large-scale pre- trained models and external knowledge, and our model is better than KPL with +1.82 and +2.45 improvement on PolitiFact and Gossipcop datasets respectively.Thus, our model can effectively adopt both external knowledge and pre-training language model.In addition, neural network methods with external knowledge are superior to those without external knowledge.It suggest that knowledge can also improve the effectiveness of fake news detection when the training data is sufficient.

Analysis
We conduct a detailed analysis from different perspectives to demonstrate the effectiveness of the modules in our model for fake news detection.

Ablation study
We conduct the ablation study to validate the effectiveness of the coarse-and fine-grained knowledge in our approach.Table 3 presents the experimental results of the models on two datasets.The coarse-grained knowledge is represented by the interaction node in the entity graph while the fine-grained knowledge is provided by the attentive graph pooling layer.If the attentive graph pooling layer is eliminated from our model, the F 1 drops by 1.34 on the PolitiFact dataset, and F 1 decreases by 0.69 on the Gossipcop dataset.When the interaction node is removed from our model, the performance of our model declines by 1.6 F 1 on the PolitiFact dataset and by 0.66 F 1 on the Gossipcop dataset.Therefore, both the coarse-grained knowledge and fine-grained knowledge are important for our model and able to improve the effectiveness of fake news detection independently.
To validate the contribution of adapting tuning, we conduct an ablation study experiment on two datasets.The results listed in Table 4 demonstrate the superiority of adapter tuning.When the adapter turning method is changed to Full Fine-Tuning (FFT) or Feature-Based (FB), our model is worse than "FB+Graph" and "FFT+Graph" on two datasets.Our model can benefit from adapter tuning.
We conduct the ablation study to validate the necessity of pruning the entity graph before feeding into the attentive graph pooling layer.We feed the pruned entity graph into both GAT layer and the attentive graph pooling layer.As Table 5 demonstrated, the performance of using both the coarseand fine-grained knowledge is better than just using fine-grained knowledge on two datasets.Thus, coarse-grained knowledge representations are beneficial to fake news detection.

Conclusion
In this paper, we propose the Knowledge grAPh enhAnced Language Model (KAPALM) for fake news detection.The proposed model integrates coarse-and fine-grained representations of entity knowledge.Entity graph is used to enrich the knowledge representation of the news article.The entity graph is pruned and fed into the attentive graph pooling layer to represent the fine-grained knowledge.The coarse-and fine-grained knowledge representations extracted from large-scale knowledge graph are combined for improving the fake news detection.Experimental result on two benchmark datasets have shown that the proposed KAPALM outperforms the state-of-the-art baseline models.In addition, KAPALM is able to obtain the competitive performance in the few-shot setting.In future work, we will investigate other effective approaches to mine the accurate knowledge from knowledge graph for fake news detection.

Limitations
One limitation of our model is that the constructed entity knowledge graph fails to consider the multiple types of relationships or extra attribute information of entities and relationships.If the veracity of a news article is associated with the attributes of the entities or the relationships among entities in this article, it may lead to the poor performance.In addition, although graph neural networks and attentive graph pooling layers can improve the performance in fake news detection, there is still a lack of the Interpretability for our model.

Figure 1 :
Figure 1: A news article on PolitiFact and its entity graph.

Figure 2 :
Figure 2: The overview of our proposed model KAPALM

Figure 3 :
Figure 3: The process of knowledge extraction.

Table 2 :
Comparsion with existing models.These scores refer to the F1 scores (%) of fake news.

Table 4 :
Comparison with different fine-tuning methods.

Table 5 :
Ablation experimental results.: "Fine-grained" denotes that the pruned entity graph are fed into both GAT layer and attentive graph pooling layer."Coarseand Fine-grained" means that the original entity graph is fed into GAT layer while the pruned entity graph is fed into attentive graph pooling layer.