Hierarchical Multi-head Attentive Network for Evidence-aware Fake News Detection

The widespread of fake news and misinformation in various domains ranging from politics, economics to public health has posed an urgent need to automatically fact-check information. A recent trend in fake news detection is to utilize evidence from external sources. However, existing evidence-aware fake news detection methods focused on either only word-level attention or evidence-level attention, which may result in suboptimal performance. In this paper, we propose a Hierarchical Multi-head Attentive Network to fact-check textual claims. Our model jointly combines multi-head word-level attention and multi-head document-level attention, which aid explanation in both word-level and evidence-level. Experiments on two real-word datasets show that our model outperforms seven state-of-the-art baselines. Improvements over baselines are from 6% to 18%. Our source code and datasets are released at https://github.com/nguyenvo09/EACL2021.


Introduction
The proliferation of biased news, misleading claims, disinformation and fake news has caused heightened negative effects on modern society in various domains ranging from politics, economics to public health. A recent study showed that maliciously fabricated and partisan stories possibly caused citizens' misperception about political candidates (Allcott and Gentzkow, 2017) during the 2016 U.S. presidential elections. In economics, the spread of fake news has manipulated stock price (Kogan et al., 2019). For example, $139 billion was wiped out when the Associated Press (AP)'s hacked Twitter account posted rumor about White House explosion with Barack Obama's injury. Recently, misinformation has caused infodemics in public health (Ashoka, 2020) and even led to people's fatalities in the physical world (Alluri, 2019).
To reduce the spread of misinformation and its detrimental influences, many fact-checking systems have been developed to fact-check textual claims. It is estimated that the number of factchecking outlets has increased 400% in 60 countries since 2014 (Stencel, 2019). Several factchecking systems such as snopes.com and politifact.com are widely used by both online users and major corporations. Facebook (CNN, 2020) recently incorporated third-party fact-checking sites to social media posts and Google integrated factchecking articles to their search engine (Wang et al., 2018). These fact-checking systems debunk claims by manually assess their credibility based on collected webpages used as evidence. However, this manual process is laborious and unscalable to handle the large volume of produced false claims on communication platforms. Therefore, in this paper, our goal is to build an automatic fake news detection system to fact-check textual claims based on collected evidence to speed up fact-checking process of the above fact-checking sites.
To detect fake news, researchers proposed to use linguistics and textual content (Castillo et al., 2011;Zhao et al., 2015;Liu et al., 2015). Since textual claims are usually deliberately written to deceive readers, it is hard to detect fake news by solely relying on the content claims. Therefore, multiple works utilized other signals such as temporal spreading patterns , network structures Vo and Lee, 2018;Shu et al., 2020) and users' feedbacks Shu et al., 2019;Vo and Lee, 2020a). However, limited work used external webpages as documents which could provide interpretive explanation to users. Several recent work (Popat et al., 2018;Ma et al., 2019;Vo and Lee, 2020b) started to utilize documents to fact-check textual claims. Popat et al. (2018) used word-level attention in documents but treated all documents with equal im-portance whereas Ma et al. (2019) only focused on which documents are more crucial without considering what words help explain credibility of textual claims.
Observing drawbacks of the existing work, we propose Hierarchical Multi-head Attentive Network which jointly utilizes word attention and evidence attention. Overall semantics of a document may be generated by multiple parts of the document. Therefore, we propose a multi-head word attention mechanism to capture different semantic contributions of words to the meaning of the documents. Since a document may have different semantic aspects corresponding to various information related to credibility of a claim, we propose a multi-head document-level attention mechanism to capture contributions of the different semantic aspects of the documents. In our attention mechanism, we also use speakers and publishers information to further improve effectiveness of our model. To our knowledge, our work is the first applying multi-head attention mechanism for both words and documents in evidence-aware fake news detection. Our work makes the following contributions: • We propose a novel hierarchical multi-head attention network which jointly combines word attention and evidence attention for evidenceaware fake news detection. • We propose a novel multi-head attention mechanism to capture important words and evidence. • Experiments on two public datasets demonstrate the effectiveness and generality of our model over state-of-the-art fake news detection techniques.

Related Work
Many methods have been proposed to detect fake news in recent years. These methods can be placed into three groups: (1) human-based fact-checking sites (e.g. Snopes.com, Politifact.com), (2) machine learning based methods and (3) hybrid systems (e.g. content moderation on social media sites). In machine-learning-based methods, researchers mainly used linguistics and textual content (Zellers et al., 2019;Zhao et al., 2015;Wang, 2017;Shu et al., 2019), temporal spreading patterns , network structures Vo and Lee, 2018;You et al., 2019), users' feedbacks Shu et al., 2019) and multimodal signals (Gupta et al., 2013;Vo and Lee, 2020b). Recently, researchers focus on fact-checking claims based on evidence from different sources. Thorne and Vlachos (2017) and Vlachos and Riedel (2015) fact-check claims using subject-predicate-object triplets extracted from knowledge graph as evidence. Chen et al. (2020) assess claims' credibility using tabular data. Our work is closely related to fact verification task (Thorne et al., 2018;Nie et al., 2019;Soleimani et al., 2020) which aims to classify a pair of a claim and an evidence extracted from Wikipedia into three classes: supported, refuted, or not enough info. For fact verification task, Nie et al. (2019) used ELMo (Peters et al., 2018) to extract contextual embeddings of words and used a modified ESIM model (Chen et al., 2017). Soleimani et al. (2020) used BERT model (Devlin et al., 2018) to retrieve and verify claims. Zhou et al. (2019) used graph based models for semantic reasoning. Our work is different from these work since our goal is to classify a pair of a claim and a list of relevant evidence into true or false.
Our work is close to existing work about evidence-aware fake news detection (Popat et al., 2018;Ma et al., 2019;Wu et al., 2020;Mishra and Setty, 2019). Popat et al. (2018) used an average pooling layer to derive claims' representation to attend to words in evidence, Mishra and Setty (2019) focused on words and sentences in each evidence, and Ma et al. (2019) proposed a semantic entailment model to attend to important evidence. However, to the best of our knowledge, our work is the first jointly using multi-head attention mechanisms to focus on important words in each evidence and important evidence from a set of relevant articles. Our attention mechanism is different from these work since we use multiple attention heads to capture different semantic contributions of words and evidence.

Problem Statement
We denote an evidence-based fact-checking dataset C as a collection of tuples (c, s, D, P) where c is a textual claim originated from a speaker s, D = {d i } k i=1 is a collection of k documents 1 relevant to the claim c and P = {p i } k i=1 is the corresponding publishers of documents in D. Note, |D| = |P|. Our goal is to classify each tuple (c, s, D, P) into a pre-defined class (i.e. true news/fake news).

Framework
In this section, we describe our Hierarchical Multihead Attentive Network for Fact-Checking (MAC) which jointly considers word-level attention and document-level attention. Our framework consists of four main components: (1) embedding layer, (2) multi-head word attention layer, (3) multi-head document attention layer and (4) output layer. These components are illustrated in Fig. 1 where we show a claim and two documents as an example.

Embedding Layer
Each claim c is modeled as a sequence of n words [w c 1 , w c 2 , ..., w c n ] and d i is viewed as another sequence of m words [w d 1 , w d 2 , ..., w d m ]. Each word w c i and w d j will be projected into D-dimensional vectors e c i and e d j respectively by an embedding matrix W e ∈ R V ×D where V is the vocabulary size. Each speaker s and publisher p i modeled as one-hot vectors are transformed into dense vectors s ∈ R D 1 and p i ∈ R D 2 respectively by using two matrices W s ∈ R S×D 1 and W p ∈ R P ×D 2 , where S and P are the number of speakers and publishers in a training set respectively. Both W s and W p are uniformly initialized in [−0.2, 0.2]. Note that, both matrices W s and W p are jointly learned with other parameters of our MAC.

Multi-head Word Attention Layer
We input word embeddings e c i of the claim c into a bidirectional LSTM (Graves et al., 2005) which helps generate contextual representation h i of each token as follows: h i are hidden states in forward and backward pass of the BiLSTM, symbol ; means concatenation and H is hidden size. We derive claim's representation in R 2H by an average pooling layer as follows: Applying a similar process on the top of each document d i with a different BiLSTM, we have contextual representation h d j ∈ R 2H for each word in d i . After going through BiLSTM, d i is modeled To understand what information in a document helps us fact-check a claim, we need to guide our model to focus on crucial keywords or phrases of the document. Drawing inspiration from (Luong et al., 2015), we firstly replicate vector c (Eq.1) m times to create matrix C 1 ∈ R m×2H and propose an attention mechanism to attend to important words in the document d i as follows: C 1 ] is concatenation of two matrices on the last dimension and a 1 ∈ R m is attention distribution on m words. However, the overall semantics of the document might be generated by multiple parts of the document (Lin et al., 2017). Therefore, we propose a multi-head word attention mechanism to capture different semantic contributions of words by extending vector w 2 into a matrix W 2 ∈ R a 1 ×h 1 where h 1 is the number of attention heads shown in Fig. 1. We modify Eq. 2 as follows: where A 1 ∈ R m×h 1 and each column of A 1 has been normalized by the softmax operation. Intuitively, A 1 stands for h 1 different attention distributions on top of m words of the document d i , helping us capture different aspects of the document. After computing A 1 , we derive representation of document d i as follows: where d i ∈ R h 1 2H and function flatten(.) flattens A T 1 · H into a vector. We also implemented a more sophisticated multi-head attention in (Vaswani et al., 2017) but did not achieve good results.

Multi-head Document Attention Layer
This layer consists of three components as follows: (1) extending representations of claims, (2) extending representations of evidence and (3) multi-head document attention mechanism. Extending representations of claims. So far the representation of the claim c (Eq. 1) is only from textual content. In reality, a speaker who made a claim may impact credibility of the claim. For example, claims from some politicians are controversial and inaccurate (Allcott and Gentzkow, 2017). Therefore, we enrich vector c by concatenating it with speaker's embedding s to generate c ext ∈ R x , where x = 2H + D 1 as shown in Eq. 5.
Extending representations of evidence. Intuitively, an article published by nytimes.com might be more reliable than a piece of news published by breitbart.com which is known to be a less credible site. Therefore, to capture more information, we further enrich representations of evidence with publishers' information by concatenating d i (Eq. 4) with its publisher's embedding p i as follows: where y = 2h 1 H + D 2 . From Eq. 6, we can generate representations of k relevant articles and stack them as shown in Eq. 7.
Multi-head Document Attention Mechanism.
In real life, a journalist from snopes.com and politifact.com may use all k articles relevant to the claim c to fact-check it but she may focus on some key articles to determine the verdict of the claim c while other articles may have negligible information. To capture such intuition, we need to downgrade uninformative documents and concentrate on more meaningful articles. Similar to Section 4.2, we use multi-head attention mechanism which produces different attention distributions representing diverse contributions of articles toward determining veracity of the claim c.
We firstly create matrix C 2 ∈ R k×x by replicating vector c ext (Eq. 5) k times. Secondly, the matrix C 2 is concatenated with matrix D (Eq. 7) on the last dimension of the two matrices denoted as [D; C 2 ] ∈ R k×(x+y) .
Our proposed multi-head document-level attention mechanism applies h 2 different attention heads as shown in Eq. 8.
where W 3 ∈ R (x+y)×a 2 , W 4 ∈ R a 2 ×h 2 . The matrix A 2 ∈ R k×h 2 , where each of its column is normalized by the softmax operator, is a collection of h 2 different attention distributions on k documents. Using attention weights, we can generate attended representation of k evidence denoted as d rich ∈ R h 2 y as shown in Eq. 9.
where flatten(.) function flattens A T 2 · D into a vector. We finally generate representation of a tuple (c, s, D, P) by concatenating vector c ext (Eq. 5) and vector d rich (Eq. 9), denoted as [c ext ; d rich ].
To the best of our knowledge, our work is the first work utilizing multi-head attention mechanism integrated with speakers and publishers information to capture various semantic contributions of evidence toward fact-checking process.

Output Layer
In this layer, we input tuple representation [c ext ; d rich ] into a multilayer perceptron (MLP) to compute probabilityŷ that the claim c is a true news as follows: where W 5 , W 6 , b 5 , b 6 are weights and biases of the MLP, and σ(.) is the sigmoid function. We optimize our model by minimizing the standard cross-entropy as shown on the top of Fig. 1.
where y ∈ {0, 1} is the ground truth label of a tuple (c, s, D, P). During training, we sample a mini batch of 32 tuples and compute average loss from the tuples.

Datasets
We employed two public datasets released by (Popat et al., 2018). Each of these datasets is a collections of tuples (c, s, D, P, y) where each textual claim c and its credible label y are collected from two major fact-checking websites snopes.com and politifact.com. The articles pertinent to the claim c are retrieved by using search engines. Each Snopes claim was labeled as true or false while in Politifact, there were originally six labels: true, mostly true, half true, false, mostly false, pants on fire. Following (Popat et al., 2018), we merge true, mostly true and half true into true claims and the rest are into false claims. Details of our datasets are presented in Table 1. Note that Snopes does not have speakers' information.

Baselines
We compare our MAC model with seven state-ofthe-art baselines divided into two groups. The first group of the baselines only used textual content of claims, and the second group of the baselines utilized relevant articles to fact-check textual claims. A related method (Mishra and Setty, 2019) used subject information of articles (e.g. politics, entertainment), which was not available in our datasets. We tried to compare with it but achieved poor results perhaps due to missing information. Therefore, we do not report its result in this paper. Details of the baselines are shown as follows: Using only claims' text: • BERT (Devlin et al., 2018) is a pre-trained language model achieving state-of-the-art re-sults on many NLP tasks. The representation of [CLS] token is inputted to a trainable linear layer to classify claims. • LSTM-Last is a model proposed in (Rashkin et al., 2017). LSTM-Last takes the last hidden state of the LSTM as representations of claims. These representations will be inputted to a linear layer for classification. Note that, we also applied BERT, LSTM-Last, LSTM-Avg and CNN by using both claims' text and articles' text. For each of these baselines, we concatenated a claim's text and a document's text, and input the concatenated content into the baseline to compute likelihood that the claim is fake news. We computed average probability based on all documents of the claim and used it as final prediction. However, we did not observe considerable improvements of these baselines. In addition to deeplearning-based baselines, we compared our MAC with other feature-based techniques (e.g. SVM). As expected, these traditional techniques had inferior performance compared with neural models. Therefore, we only report the seven baselines' performance.

Experimental Settings
For each dataset, we randomly select 10% number of claims from each class to form a validation set, which is used for tuning hyper-parameters. We report 5-fold stratified cross validation results on the remaining 90% of the data. We train our model and baselines on 4-folds and test them on the remaining fold. We use AUC, macro/micro F1, class-specific F1, Precision and Recall as evaluation metrics. To mitigate overfitting and reduce training time, we early stop training process on the validation set when F1 macro on the validation data continuously decreases in 10 epochs. When we get the same F1 macro between consecutive epochs, we rely on AUC for early stopping.
For fair comparisons, we use Adam optimizer (Kingma and Ba, 2014) with learning rate 0.001 and regularize parameters of all methods with ℓ 2 norm and weight decay λ = 0.001. As the maximum lengths of claims and articles in words are 30 and 100 respectively for both datasets, we set n = 30 and m = 100. For HAN and our model, we set k = 30 since the number of articles for each claim is at most 30 in both datasets. Batch size is set to 32 and we trained all models until convergence. We tune all models including ours with hidden size H chosen from {64, 128, 300}, pretrained word-embeddings are from Glove (Pennington et al., 2014) with D = 300. Both D 1 and D 2 are tuned from {128, 256}. The number of attention heads h 1 and h 2 is chosen from {1, 2, 3, 4, 5}, a 1 and a 2 are equal to 2 × H. In addition to Glove, we also utilized contextual embeddings from pretrained language models such as ELMo and BERT but achieved comparable performances. We implemented all methods in PyTorch 0.4.1 and run experiments on an NVIDIA GTX 1080.

Performance of MAC and baselines
We show experimental results of our model and baselines in Tables 2 and 3. In Table 2, MAC outperforms all baselines with significance level p < 0.05 by using one-sided paired Wilcoxon test on Snopes dataset. MAC achieves the best result when h 1 = 5, h 2 = 2, H = 300 and D 1 = D 2 = 128. In Table 3, MAC also significantly outperforms all baselines with p < 0.05 according to one-sided paired Wilcoxon test on PolitiFact dataset. The hyperparameters we selected for MAC are h 1 = 3, h 2 = 1, H = 300 and D 1 = D 2 = 128.
For baselines, BERT is used as a static encoder. We tried to fine tune it but even achieve worse results. This might be because we do not have sufficient data to tune it. For both HAN and DeClare, since both papers do not release their source code, we tried our best to reproduce results from these two models. HAN model derived representation of each document by using the last hidden state of a GRU (Chung et al., 2014) without any attention mechanism on words to downgrade unimportant words (e.g. stop words), leading to poor representations of documents. Therefore, document-level attention mechanism in HAN model did not perform well. Similar patterns can be observed in two baselines LSTM-Avg and LSTM-Last. DeClare performed best among baselines, indicating the importance of applying word-level attention on words to reduce impact of less informative words.
We can see that our MAC outperforms all baselines in all metrics. When viewing true news as positive class, our MAC has an average increase of 16.0% and 7.1% over the best baselines on Snopes and PolitiFact respectively. We also have an increase of 4.7% improvements over baselines with a maximum improvements of 10.1% in PolitiFact   when considering fake news as negative class. In terms of AUC, average improvements of MAC over the baselines are 7.9% and 6.1% on Snopes and PolitiFact respectively. Improvements of MAC over baselines can be explained by our multi-head attention mechanism shown in Eq. 3 and Eq. 8. After attending to words in documents, we can generate better representations of documents/evidence, leading to more effective document-level attention compared with HAN model.

Ablation Studies
Impact of Word Attention and Evidence Attention. We study the impact of attention layers on performance of MAC by (1) using only word attention and replacing evidence attention with an average pooling layer on top of documents' representations and (2) using only evidence attention and replacing word attention with an average pooling layer on top of words' representations. As we can see in Table 4, using only word attention performs much better than using only evidence attention. This is because without downgrading less infor- mative words in evidence, irrelevant information can be captured, leading to low quality representations of evidence. This experiment aligns with our observation that HAN model, which used only evidence attention, did not perform well. When combining both attention mechanisms hierarchically, we consistently achieve best results on two datasets in Table 4. In particular, the model Word & Doc Att outperformed both Only Evidence Att and Only Evidence Att significantly with p-value < 0.05. This result indicates that it is crucial to combine word-level attention and document-level attention to improve the performance of evidenceaware fake news detection task. Impact of Speakers and Publishers on MAC. To study how speakers and publishers impact performance of MAC, we experiment four models: (1) using text only (Text Only), (2) using text and publishers (Text + Publishers), (3) using text and speakers (Text + Speakers) and (4) using text, publishers and speakers (Text + Pubs + Spkrs). In Table 5, Text + Publishers has better performance then using only text in both datasets. In PolitiFact, Text + Speakers achieves 2∼3% improvements over Text + Publishers, indicating that speakers who made claims are crucial to determine verdict of the claims. Finally, using all information (Text + Pubs + Spkrs) helps us achieve the best result in PolitiFact. In Snopes, we omit results of Text + Speakers and Text + Pubs + Spkrs because the dataset does not contain speakers' information. In particular, model Text + Pubs + Spkrs outperformed methods Text Only and Text + Publishers significantly (p-value< 0.05). Based on these results, we conclude that integrating information of speakers and publishers is useful for detecting misinformation.

Impact of the Number of Attention Heads
In this section, we examine sensitivity of MAC with respect to the number of heads h 1 in word attention layer and the number of heads h 2 in document attention layer. We vary h 1 and h 2 in {1, 2, 3, 4, 5}.
Since AUC is less sensitive to any threshold, we report AUC of MAC on two datasets in Fig. 2(a) and 2(b). A common pattern we can observe in the two figures is that performance of MAC tends to be better when we increase the number heads h 1 in word attention layer while performance of MAC tends to decrease when increasing h 2 . This phenomenon indicates that word attention is more important than evidence attention. In Snopes, MAC has the best AUC when h 1 = 5, h 2 = 2. In PolitiFact, MAC reaches the peak when h 1 = 3, h 2 = 1.

Case Study
To understand how multi-head attention mechanism works, from the testing set, we visualize attention weights on three documents of a false claim Actor Christopher Walken planning making bid US presidency 2008. Note, our MAC correctly classifies the claim as fake news. In Fig. 3 and Fig. 4, we show the claim and visualization of two different heads in word attention layer. Note that Popat et al.  (2018), who released the datasets, already lowercased and removed punctuations. To conduct fair comparison, we directly used the datasets without any additional preprocessing. In Fig. 3, attention weights are sparse, indicating that the first attention head focuses on the most important words which determine credibility of the claim (e.g. hoax, false). Differently, in Fig. 4, the second attention head has more diffused attention weights to capture more useful phrases from documents (e.g. walken not running, its obviously not). Moving on to attention heads in evidence attention layer in Fig. 5, we show a heat map where the x-axis is the five heads extracted from evidence attention layer and the y-axis is three documents relevant to the same claim in Fig. 3 and 4. As we can see in Fig. 5, Head 1, Head 3 and Head 5 emphasize on Doc 3 which contains refuting phrases (e.g. its obviously not), while Head 4 focuses on Doc 1 which has negating information such as walken not running. Both Doc 1 and Doc 3 have crucial signals to fact-check the claim. From these analyses, we conclude that heads in word attention layer capture different semantic contributions of words and different heads in document attention layer captures important documents.

Conclusions
In this paper, we propose a novel evidence-aware model to fact-check textual claims. Our MAC is designed by hierarchically stacking two attention layers. The first one is a word attention layer and the second one is a document attention layer. In both layers, we propose multi-head attention mechanisms to capture different semantic contributions of words and documents. Our MAC outperforms the baselines significantly with an average increase of 6% to 9% over the best results from baselines with a maximum improvements of 18%. We conduct ablation studies to understand the performance of MAC and provide a case study to show the effectiveness of the attention mechanisms. In future work, we will further examine other data types such as images to improve the performance of our model.