FakeFlow: Fake News Detection by Modeling the Flow of Affective Information

Fake news articles often stir the readers’ attention by means of emotional appeals that arouse their feelings. Unlike in short news texts, authors of longer articles can exploit such affective factors to manipulate readers by adding exaggerations or fabricating events, in order to affect the readers’ emotions. To capture this, we propose in this paper to model the flow of affective information in fake news articles using a neural architecture. The proposed model, FakeFlow, learns this flow by combining topic and affective information extracted from text. We evaluate the model’s performance with several experiments on four real-world datasets. The results show that FakeFlow achieves superior results when compared against state-of-the-art methods, thus confirming the importance of capturing the flow of the affective information in news articles.


Introduction
In today's information landscape, fake news are used to manipulate public opinion (Zhou and Zafarani, 2018) by reshaping readers' opinions regarding some issues. In order to achieve this goal, authors of fake news' narratives need to capture the interest of the reader. Thus, they are putting efforts to make their news articles look more objective and realistic. This is usually done by adding misleading terms or events that can have a negative or positive impact on the readers' emotions.
Short text false information, e.g., fake claims or misleading headlines, might be less harmful than news articles. They may have some eye-catching terms that aim to manipulate the readers' emotions (Chakraborty et al., 2016). In many cases, the identification of this kind of exaggeration in short statements can unmask the fabrication. On the other hand, in fake news articles the authors exploit the length of the news to conceal their fabricated story. This fact exposes the readers to be emotionally manipulated while reading longer texts that have several imprecise or fabricated plots. The flow of information has been investigated for different tasks: Reagan et al. (2016) studied the emotional arcs in stories in order to understand complex emotional trajectories;  model the flow of emotions over a book and quantify its usefulness for predicting success in books;  explore the problem of creating tags for movies from plot synopses using emotions.
Unlike previous works (Rashkin et al., 2017;Shu et al., 2018;Castelo et al., 2019;Ghanem et al., 2020) that discarded the chronological order of events in news articles, in this work we propose a model that takes into account the affective changes in texts to detect fake news. We hypothesize that fake news has a different distribution of affective information across the text compared to real news, e.g. more fear emotion in the first part of the article or more overall offensive terms, etc. Therefore, modeling the flow of such information may help discriminating fake from real news. Our model consists of two main sub-modules, topic-based and affective information detection. We combine these two sub-modules since a news article's topic may have a correlation with its affective information. For example, a fake news article about Islam or Black people is likely to provoke fear and express negative sentiment while another fake news that is in favor of a particular politician might try to evoke more positive emotions and also express some exaggerations. The contributions of our work are as follows: • We design a model that detects fake news articles by taking into account the flow of affective information 1 .
• Extensive experiments on four standard datasets demonstrate the effectiveness of our model over state-of-the-art alternatives.
• We build a novel fake news dataset, called Multi-SourceFake, that is collected from a large set of websites and annotated on the basis of the joint agreement of a set of news sources.

Related Work
Previous work on fake news detection is mainly divided into two main lines, namely with a focus on social media (Zubiaga et al., 2015;Aker et al., 2017;Ghanem et al., 2019) or online news articles (Tausczik and Pennebaker, 2010;Horne and Adali, 2017;Rashkin et al., 2017;Barrón-Cedeno et al., 2019). In this work we focus on the latter one. Factchecking (Karadzhov et al., 2017;Zlatkova et al., 2019;Shu et al., 2019a) is another closely related research topic. However, fact-checking targets only short texts (that is, claims) and focuses on using external resources (e.g. Web, knowledge sources) to verify the factuality of the news. The focus in previous work on fake news detection is mainly on proposing new feature sets. Horne and Adali (2017) present a set of content-based features, including readability (number of unique words, SMOG readability measure, etc.), stylistic (frequency of partof-speech tags, number of stop words, etc.) and psycholinguistic features (i.e., several categories from the LIWC dictionary (Tausczik and Pennebaker, 2010)). When these features are fed into a Support Vector Machine (SVM) classifier and applied, for instance, to the task of distinguishing satire from real news, they obtain high accuracies. Using the same features for the task of fake news detection, however, results in somewhat lower scores.  Ghanem et al. (2020). EIN encodes the text of the article and their affective content, based on several dictionaries, and then combines the two vector representations. The authors evaluate their model on a multi-class false information dataset and show the effectiveness of using emotion features extracted from the text. Despite the large variety of features and models that have been explored in previous work, none of these works considers the sequence of affective information in text; instead, they feed the entire news articles as one segment into their models. In contrast, the aim of our work is to evaluate this source of information, using a neural architecture.

The FakeFlow Model
Given an input document, the FakeFlow model first divides it into N segments. Then it uses both word embeddings and other affective features such as emotions, hyperbolic words, etc. in a way to catch the flow of emotions in the document. The model learns to pay attention to the flow of affective information throughout the document, in order to detect whether it is fake or real. Figure 1 shows the architecture of the FakeFlow model. The neural architecture has two main modules: The first module uses a Convolutional Neural Network (CNN) to extract topic-based information from articles (left branch). The second module models the flow of the affective information within the articles via Bidirectional Gated Recurrent Units (Bi-GRUs) (right branch).

Topic-based Information
Given a segment n ∈ N of words, the model first embeds words to vectors through an embedding matrix. Then it uses a CNN that applies convolution processes and max pooling to get an abstractive representation of the input segment. This representation highlights important words, in which the topic information of the segment is summarized. Then it applies a fully connected layer on the output segments to get a smaller representation (v topic ) for later concatenation with the representation of affective information: where W a and b a are the corresponding weight matrix and bias terms, and f is an activation function such as ReLU, tanh, etc.
Key to FakeFlow is its ability to capture the relevance of the affective information with respect to the topics. For this, we concatenate the topic summarized vector v topic with the representation vector v affect , aimed at capturing the affective information extracted from each segment (Section 3.2).
To merge the different representations and capture their joint interaction in each segment, the model processes the produced concatenated vector v concat with another fully connected layer: In order to create an attention-focused representation of the segments to highlight important ones and to provide the model with the ability to weight segments differently according to the similarity of neighboring segments, the model applies a context-aware self-attention mechanism (Zheng et al., 2018) on v f c . This is a crucial step, as the importance of a segment at timestep t is related to the other segments since they share the same context in the news article. Moreover, applying the attention layer can help us understand which features are most relevant by showing to which words the network attends to during learning. The output of the attention layer is an attention matrix l t with scores for each token at each timestep.

Affective Flow of Information
To model the affective information flow in the news articles, we choose the following lexical features, under the assumption that they have a different distribution across the articles' segments. We use a term frequency representation weighted by the articles' length to extract the following features from each segment n: • Emotions: We use emotions as features to detect their change among articles' segments. For that we use the NRC emotions lexicon (Mohammad and Turney, 2010) that contains ∼14K words labeled using the eight Plutchik's emotions (8 Features).
• Sentiment: We extract the sentiment from the text, positive and negative, again using the NRC lexicon (Mohammad and Turney, 2010) (2 Features).
• Morality: We consider cue words from the Moral Foundations Dictionary 2 (Graham et al., 2009) where words are assigned to one (or more) of the following categories: care, harm, fairness, unfairness (cheating), loyalty, betrayal, authority, subversion, sanctity and degradation (10 Features).
• Imageability: We use a list of words rated by their degree of abstractness and imageability 3 . These words have been extracted from the MRC psycholinguistic database (Wilson, 1988) and then using a supervised learning algorithm, the words have been annotated by the degrees of abstractness and imageability. The list contains 4,295 and 1,156 words rated by their degree of abstractness and imageability, respectively (2 Features).
To model the flow of the above features, we represent each segment of an article by a vector v affect capturing all 23 features listed above. Then we feed the document's vectors to a Bi-GRU network to summarize the contextual flow of the features from both directions 4 to obtain v flow .
Given the segments' flow representation (v flow ) of an article and their relevance to the topics (l t ), FakeFlow applies a dot product operation and then averages the output matrix across the segments to get a compact representation v compact , which is then fed into a fully connected layer: Finally, to generate the overall factuality label of an article, a softmax layer is applied to the output of the fully connected layer.

Fake News Datasets
Despite the recent efforts for debunking online fake news, there is a dearth of publicly available datasets. Most of the available datasets are small in size (e.g., the Politifact 5 dataset in (Shu et al., 2018) has ∼700 available articles, the Celebrity dataset in (Pérez-Rosas et al., 2018) has ∼500 articles, etc.), their test parts have not been manually annotated, or have been collected from a very small number of news sources. Nonetheless, we evaluate FakeFlow on three different available datasets to demonstrate its performance. In addition, we create our own dataset. Table 1 gives an overview of the datasets that we used in our work.

MultiSourceFake:
We rely on different resources for creating the training and test portions of the dataset, so as to provide a challenging benchmark.
For the training part, we use OpenSources.co (OS), MediaBiasFactCheck.com (MBFC), and Poli-tiFact 6 news websites' lists. OS list contains 560 domains, MBFC list has 548 domains, and the Poli-tiFact list has 227 domains. These lists have been annotated by professional journalists. The lists contain domains of online news websites annotated based on the content type (as in the OS news list: satire, reliable, etc.; and in the PolitiFact news list: imposter, parody, fake news, etc.) or from a factuality perspective (as in the MBFC news list: low, medium, and high factuality). From the OS list, we select domains that are in one of the following categories: fake, bias, reliable, hate, satire, or conspiracy. We consider domains under the reliable category as real news sources, and the rest as fake. The PolitiFact list is different from the OS list since it has only labels for domains that are either fake or with mixed content. We discard the mixed ones 7 and map the remaining ones to the fake news label. Finally, we select from the MBFC list those domains that are annotated either as high or low factual news and we map them to real and fake labels, respectively. Out of these three final lists, we select only those domains for our dataset that are annotated in all lists in a consistent way; for example, we discard those domains that are annotated as real in the OS list but their label in the MBFC list is fake (low factuality). The final list contains 85 news websites. We now proceed by projecting the domain-level ground truth onto the content of those domains and randomly sample articles, with a maximum of 100 news articles per domain. 8 For the test part, we use the leadstories.com fact checking website for which professional journalists annotated online news articles on the article level as fake or real. We do not follow the way we annotate the training part since the projection of the domainlevel ground truth inevitably introduces noise. The journalists that annotated leadstories.com assigned a set of labels to the fake news articles like, e.g., false, no evidence, satire, misleading, etc.; we map them all to the fake label. In addition, we discard all articles that are multimedia-based. After collecting the news articles, we postprocess them by discarding very short articles (less than 30 words). The test part includes 689 fake news articles. We complement the set with a sample of 1,000 real  news articles from the training part. The overall dataset consists of 5,994 real and 5,403 fake news articles. The average document length (number of words) in the MultiSourceFake dataset is 422 words, and the 95th percentile value is 942. Figure  2 shows the distribution of the documents' length in the dataset.
TruthShades: This dataset has been proposed in Rashkin et al. (2017). The dataset was crawled from a set of domains that are annotated by professional journalists as either propaganda, hoax, satire, or real. The dataset has been built from the English Gigaword corpus for real news, and other seven unreliable domains that annotated in one of the three previous false information labels.
PoliticalNews: Due to the fact that: "a classifier trained using content from articles published at a given time is likely to become ineffective in the future" (Castelo et al., 2019), the authors of this work collected a dataset by crawling news websites in between the years 2013 to 2018 in order to evaluate their model's performance on different years.
FakeNewsNet: is a fake news repository that consists of two comprehensive datasets, one collected using claims from PolitiFact and the other from the GossipCop fact checking website. Given the large number of true and false claims from these two fact checking websites, Shu et al. (2018) built news datasets that contain visual and textual news articles content and social media information by searching Twitter for users who shared news. Out of the whole collected information, we use only the textual information of news articles, which is the part we are interested in.

Experiments
Experimental setup. We split the articles' text into N segments and set the maximum length of segments to 800 words, applying zero padding to the ones shorter than 800 words. Concerning the FakeFlow hyper-parameters, we tune various parameters (dropout, the size of the dense layers, activation functions, CNN filter sizes and their numbers, pooling size, size of the GRU layer, and the optimization function) (see Appendix A for the search space) using early stopping on the validation set. In addition to these hyper-parameters, we also use the validation set to pick the best number of segments (N ). Regarding the MultiSourceFake dataset, we use 20% of the training part for validation. We represent words using pre-trained word2vec Google-News-300 embeddings 9 . For evaluation, we follow the setup from related work. We report accuracy and weighted precision, recall and F1 score, and macro F1 for some datasets where the classes are imbalanced.
Baselines. To evaluate the performance of our model, we use a combination of fake news detection models and deep neural network architectures: • CNN, LSTM: We use CNN and LSTM models and validate their performance when treating each document as one fragment. We experiment with different hyper-parameters and report results for the ones that performed best on the validation set.
• HAN: The authors of (Yang et al., 2016) proposed a Hierarchical Attention Networks (HAN) model for long document classification. The proposed model consists of two levels of attention mechanisms, i.e., word and sentence attention. The model splits each document into sentences and learns sentence representations from words.
• BERT: is a text representation model that showed superior performance on multiple natural language processing (NLP) benchmarks (Devlin et al., 2019). We use the pre-trained bertbase-uncased version which has 12-layers and yields output embeddings with a dimension of size 768. We feed the hidden representation of the special [CLS] token, that BERT uses to summarize the full input sentence, to a softmax layer. Experimentally, we found that fine-tuning BERT layers gives a higher performance. It is worth mentioning that BERT input length is limited to 512 word pieces (sub-words level) (Devlin et al., 2019), thus, we discard the rest of the text in long news articles.
• Fake News Detection Models: We compare our model to several fake news detection models. We use Horne and Adali (2017) (Ghanem et al., 2020). 10 • Longformer: Giving that Transformer-based models (i.e. BERT) are unable to process long sequences, we use Longformer (Beltagy et al., 2020), which is a SOTA model for long document tasks. In our experiments, we set the max sequence length to 1500 to handle documents that have more than 512 tokens in the Multi-SourceFake dataset (see Figure 2). Also, we found that fine-tuning the Longformer model gives better results and a much faster convergence. Table 2 presents the results of our proposed model and the baselines on the MultiSourceFake dataset. Our best result was achieved by using 10 as the number of segments (N , as found on the validation data). In Figure 3 we show the model's performance for segments of different length. 11 In general, the results show that models that are based on either word ngrams or word embeddings are performing better than other models that use handcrafted features, e.g. Horne and Adali (2017). Also, despite the huge amount of data used to train the BERT model, the results show that BERT performs worse than FakeFlow and also fails to outperform some of the other models. We speculate that this is due to the fact that the input length in BERT is limited to 512 words, as we mentioned previously, and a large portion of the news articles in the Mul-tiSourceFake dataset has a length greater than 512 words. The results of the Longformer model confirm our claim regarding the documents' length and show a significantly higher F1 score than the BERT  model. This emphasizes that despite the strong performance of BERT on multiple NLP benchmarks, it is unable to handle long text documents, in contrast, e.g., to vanilla text categorization (Adhikari et al., 2019). In addition, Longformer's results show a higher F1 score than the FakeFlow model, yet, the difference is statically insignificant.

Results and Analysis
To isolate the contribution of topical vs. affective information we run two simplified versions of our architecture, each consisting of the networks to capture topical and affective information only. The results show that the flow of the affect information has a weak performance when used alone; this emphasizes that affective information of a news article is a meaningful, yet complementary source of information.
Performance on Multiple Datasets. In Table  3 we compare the performance of the FakeFlow model to SOTA results on the other datasets we introduced in Section 4. The TruthShades dataset has two test sets, in-domain and out-of-domain. In the in-domain configuration, training and test articles come from the same sources, and from different sources in out-of-domain configuration. The results demonstrate that FakeFlow achieves a better  F1 on both test sets. In a similar way, the results on the PoliticalNews dataset show that FakeFlow also outperforms the TopicAgnostic model, although the gap in results is not very large. Finally, regarding the FakeNewsNet dataset, it looks that the deep learning-based model (FakeNewsTracker) does not achieve a good performance comparing to the other proposed baseline by the authors, which is a Logistic Regression (LR) classifier with one-hot vectors of the news articles' text. Furthermore, it seems that a simple word-based model works better than a more sophisticated model that incorporates social media and context information. The FakeFlow model, on the other hand, achieves a better result, outperforming both the FakeNewsTracker and the LR baseline.
Topic-Aware Model. Constantly, new events are covered by news agencies. These events are different from the old ones in terms of discourse and topic. Therefore, a fake news detector trained on news articles from years back is unable to detect recent news. In this experiment, we are evaluating our approach on the PoliticalNews dataset that is constructed from news distributed across different years (2013 to 2018). Following the experimental setup in (Castelo et al., 2019), we train the Fake-Flow model on news from one year and test on the other years, one year at a time for testing. For example, we train the model on news from 2013 and we test on news from 2015. Note that each test set is associated with 5 results, one for each year. Figure 4 shows the average accuracy for each test set. We compare FakeFlow to the TopicAgnostic model that proved to be effective at detecting fake news from different years. It is worth mentioning that the features of the TopicAgnostic model have been extracted from both headlines and text of the news articles. However, the results show that both models have a similar performance, except for the 2013 test set where FakeFlow achieves a higher accuracy with a difference of 7%. The experiment shows that FakeFlow is capable of detecting fake news from different years, with a flat performance across the years.
Attention Weights. The proposed FakeFlow model shows that taking into account the flow of affective information in fake news is an important perspective for fake news detection. We argue that being able to better understand the behaviour of the model can make it more transparent to the endusers. Figure 5 illustrates this by showing the attention weights of a fake news article across the 10 segments (left bar). 12 The figure shows that Fake-Flow attends more to the beginning of the article. For better understanding, we match the affective information with the attention weights. Regarding the news text in the figure, the emotions features 13 show a clear example of how fake news articles try to manipulate the reader. It looks as if the existence of fear, sadness, and surprise emotions at the beginning of the article have triggered the attention on this part. Towards the end of the article, on the other hand, we can notice that such negative emotions do not exist, while emotions like joy and anticipation appear. This exemplifies how fake news try to attract the readers' attention in the first part of the text. Regarding the morality features, we only match the word "kill" with the harm category. Also, for the hyperbolic feature, we match the words "terrifying" and "powerful". In the same manner, both morality and hyperbolic features match words that occur at the beginning of the article. Lastly, for both sentiment and im- 12 We averaged the attention weight matrix along the timesteps (number of segments) representations. 13 Words with multiple colors mean that they have been annotated with multiple emotion types in the NRC lexicon.
ageability features, we are not able to find a clear interpretation in this example where many words across the segments match.
Real vs. Fake Analysis. In Table 4 we present an analysis on both real and fake news articles. The analysis gives an intuition to the reader on the distribution of the used features across the articles' segments. It shows that an emotion like fear has on average a higher difference between the first and the last segment in fake news than in real ones (see Figure 6 for a visualized distribution). Also, a feature like hyperbolic has a higher average value and lower standard deviation across all segments for fake news than real news, thus indicating that fake news have a higher amount of hyperbolic words with similarly high values.

Conclusion
In this paper we presented FakeFlow, a model that takes into account the flow of affective information (emotions, sentiment, hyperbolic words, etc.) in texts to better detect fake news articles. The model receives as input a text, segmented into smaller units, instead of processing one long fragment. This enables it to learn the flow of affective information by modeling the interaction between the topic and affective terms in the news article. We evaluated our model on four different datasets and compared it to several strong baselines. The extensive experiments show the effectiveness of FakeFlow over state-of-the-art models. Although FakeFlow was trained using a limited amount of text, the results demonstrated that it achieves results on-par with resource-hungry models (e.g. BERT and Longformer). In future work, we plan to extend our dataset and study more fine-grained news types, e.g. propaganda, from an emotional perspective. Moreover, we plan to investigate how we can replace the lexicon-based information with language-independent approaches in an attempt to make our model multilingual.   Table 4: A quantitative analysis of the features existence across articles' segments. We present the average value in the first segment (µ f irst seg. ), the average value in the last segment (µ last seg. ), the average value in the all 10 segments (µ all seg. ), and the standard deviation (σ all seg. ) of a feature across the 10 segments, both in real and fake news.
For the parameters selection, we use hyperopt 14 library that receives the above search space to randomly select different N combination of parameters (trials). We use a small value of N in all of our experiments to avoid overdrawn finetuning; we set N to 35.

A.2 Topic Aware experiments
In Figure 4, we present the average accuracy of our model when we train on different years and test a specific one. In the following we show the results before we averaged them.