Hidden Biases in Unreliable News Detection Datasets

Automatic unreliable news detection is a research problem with great potential impact. Recently, several papers have shown promising results on large-scale news datasets with models that only use the article itself without resorting to any fact-checking mechanism or retrieving any supporting evidence. In this work, we take a closer look at these datasets. While they all provide valuable resources for future research, we observe a number of problems that may lead to results that do not generalize in more realistic settings. Specifically, we show that selection bias during data collection leads to undesired artifacts in the datasets. In addition, while most systems train and predict at the level of individual articles, overlapping article sources in the training and evaluation data can provide a strong confounding factor that models can exploit. In the presence of this confounding factor, the models can achieve good performance by directly memorizing the site-label mapping instead of modeling the real task of unreliable news detection. We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap. Using the observations and experimental results, we provide practical suggestions on how to create more reliable datasets for the unreliable news detection task. We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.


Introduction
The proliferation of unreliable news is widely acknowledged (Del Vicario et al., 2016;Lazer et al., 2018;Vosoughi et al., 2018), and its identification is a socially important problem. In this work we use the label unreliable news as a broad term for all unverifiable and misleading news content, regardless of whether the content is malicious (targeted misinformation) or not. Accordingly, while specific definitions vary in different datasets used in this work, we refrain from using the term "fake" since identifying the intent of the author(s) is beyond the scope of this work. To mitigate the problem of surfacing unreliable news content, various websites (e.g., PolitiFact 2 , Media Bias/Fact Check (MBFC) 3 , GossipCop 4 , etc.) determine the reliability of news by manually fact-checking the important claims in given news articles. Beyond requiring investigative expertise, manual fact-checking is time-consuming and is thus limited to only a small set of selected news articles.
Recent research has explored automating this process using machine learning methods to automatically determine news veracity (Pérez-Rosas et al., 2018;Baly et al., 2018;Nie et al., 2019;Wright and Augenstein, 2020). These efforts were made possible due to the availability of large-scale unreliable news detection datasets (Horne et al., 2018b;Shu et al., 2017;. In our work, we examine if these datasets accurately reflect the real difficulty of this task or if there are any hidden biases in the datasets. Specifically, we study different methods of dataset construction (e.g., how the data was collected, how the data was split, etc.) and show that the assessed difficulty of the task is sensitive to how carefully different factors are considered when building and using these datasets.
Our investigation begins with data collection procedures: we look at the source of news stories (news outlets, social media, fact-checking websites, etc.) as well as the annotation process (number of labels, granularity of labels, article-or site-level annotation). We discuss the pros and cons of each approach and point out some hidden pitfalls. Using FakeNewsNet (Shu et al., 2017) as an example, we demonstrate how selection biases in data collection can lead to undesired biases in the created datasets.
Moving beyond data collection, we examine two commonly applied ways of splitting the dataset for training and testing that help the model achieve high performance without correctly modeling the task. Specifically, we show that using a disjoint set of sites/news outlets for training and test data significantly decreases the models' performance (>10%) and that the drop in performance is related to how similar (or dissimilar) the sites in both sets are (reflected by various site-level distributional distance metrics including L2, cos, EMD, etc.). Additionally, we also examine the effect of time overlap between both train and test sets. We observe that different news outlets are likely to have similar content in a small time window (i.e., the same story gets covered by multiple outlets within a day or a few days period). While we do not find any evidence that the studied models exploit this factor, we nevertheless suggest that future datasets are split both by time and site/news outlet.
In summary, our main contributions are: (1) showing how data collection procedures can lead to systematic biases in unreliable news datasets, (2) demonstrating how confounding factors--such as site/news outlet and time--in these datasets can degrade their quality and lead to underestimating the difficulty of the task, and finally (3) suggesting possible mechanisms to avoid these biases and confounding factors when building new datasets. To facilitate future research, we also provide a list of practical suggestions for data collection, dataset construction, and experiment design in Table 1.

Related Work
Unreliable News Detection. Unreliable news detection and other news veracity related tasks have been receiving an increasing focus as news sources have become more accessible in recent years. A lot of effort has been put into collecting high-quality datasets. Wang (2017); Shu et al. (2017) collected manually labeled statements or news articles from fact-checking websites. The NELA datasets (Horne et al., 2018b;Nørregaard et al., 2019;Gruppi et al., 2020) scrape news articles directly from news outlets and use the manually annotated labels from Media Bias/Fact Check (MBFC) as site-level annotations. Social media is also a popular resource for collecting news stories (Nakamura et al., 2020;Santia and Williams, 2018;Mitra and Gilbert, 2015). Researchers have also collected datasets for various related topics, such as rumor detection (Kwon et al., 2017;Ma et al., 2016), and propaganda detection (Da San Martino et al., 2020;Barrón-Cedeno et al., 2019). Besides classifying the veracity of news articles, researchers have also explored related problems, such as predicting the reliability of news sites (Baly et al., 2018), identifying factcheck worthy sentences (Wright and Augenstein, 2020), among other tasks. Several recent papers also focus on measuring the trustworthiness of single statements Pomerleau and Rao, 2017;Alhindi et al., 2018). In this work, we focus on article-level classification because of its relevance to applications, like news feeds, that operate at the article level.  Pitfalls in Data Collection. Datasets collected through crowd-sourcing or scraping the Internet have the advantage of much better scalability compared to expert-annotated datasets. However, these automatic processes are prone to hidden pitfalls. Gururangan et al. (2018); Poliak et al. (2018) show that crowd-sourcing "Natural Language Inference" datasets leads to various dataset biases. Similar observations have been made for "Fact Verification" datasets (Schuster et al., 2019). Splitting data--for training, testing, and validation--is another important procedure in creating datasets that can lead to several problems. For example, Geva et al. (2019) show that models may just learn the patterns of certain annotators in a random split. Lewis et al. (2020b) demonstrated a significant overlap in current open-domain QA datasets. When present, these unexpected biases or overlaps in datasets can significantly undermine the utility of a dataset and lead to deceptively promising results that are in part due to artifacts of flaws in the dataset rather than successfully modeling the intended task.
Automated Fact Checking for Statements. Automated fact checking is an important task closely related to unreliable news detection, yet is constructed in a more controlled manner. This task focuses on strictly judging the factuality of one single statement instead of an entire article. Vlachos and Riedel (2014) first constructed a dataset with 106 claims from fact-checking websites with paired labels. FEVER (Thorne et al., 2018) is currently the largest scale fact-verification dataset, where 185,445 claims were generated by modifying sentences from Wikipedia. Both the altered claims and the ground truth supporting evidence are included in the dataset. Existing effective approaches for fact-verification include self-attention based networks (Nie et al., 2019), large-scale pretrained transformers (Soleimani et al., 2020), neural retrieval methods (Lewis et al., 2020a), and reasoning 5 dataverse.harvard.edu/dataverse/nela 6 github.com/KaiDMML/FakeNewsNet 7 github.com/entitize/Fakeddit on semantic-level graphs (Zhong et al., 2020).

Unreliable News Datasets
Collecting high-quality datasets plays an important role in automatic unreliable news detection research. Here we review dataset collection strategies used in constructing recent datasets and point out some hidden pitfalls in these procedures.

Data Collection Strategies
Unreliable news detection is usually formalized as a classification task. Accordingly, constructing a dataset requires collecting pairs of news articles and labels.
News Articles: Each individual news outlet has its own website where news articles are published. The easiest way to collect a large number of these articles is to simply scrape these websites. Manual annotation or some other mechanism must then be incorporated in order to collect the corresponding labels for each article. Another common way to collect articles is through fact-checking websites. While this approach provides both articles and article-level labels, it normally only provides a limited set of articles. Additionally, scraping these fact-checking websites can lead to additional selection bias in the dataset as highlighted in Section 3.2. One other recent trend is collecting posts and corresponding labels from social media (Nakamura et al., 2020;Santia and Williams, 2018;Mitra and Gilbert, 2015). While large-scale datasets can be collected through such an approach, they are often noisier than those collected through traditional news sources, due to a more casual use of language, and a heavier dependency on the context. News Labels: The largest challenge in collecting these datasets lies in collecting labels. Manually checking the factuality (or reliability) and bias of a single article is time-consuming and requires non-trivial expertise. Modeling such a task through a crowd-sourcing framework is diffi-   Table 4: An example showing a reliable news article from the "Daily Mail' site which has a "Low" factual reporting rate on MBFC. Despite coming from a source with low reliability score, the shown article is reliable and very similar to the content on sites with high reliability scores (such as "BBC" and "The Week UK") on the same date.
cult. As such, current research datasets almost exclusively rely on existing resources. As discussed earlier, these resources either provide articlelevel or site-level labels. Article-level labels are only available through a few fact-checking websites such as PolitiFact, GossipCop, etc., but the scale is limited since generating these labels is time-consuming and costly. Site-/Outlet-level labels, on the other hand, available through websites such as MBFC, provide manual labels for each site/outlet. These websites often assign reliable/unreliable or biased/unbiased labels to each news outlet. Many datasets for unreliable news detection assign these site-level labels to all articles in a given site. While these weak or distant labels are not always accurate (one example is shown in Table 4) , they provide an easy way to create large-scale datasets. In Table 2, we highlight three recent large-scale unreliable news datasets along with their data collection procedure.

Dataset Selection Biases
Datasets annotated without expert verification (e.g., through crowdsourcing, automatic web scraping, etc.) can have some undesired properties that undermine their quality (Gururangan et al., 2018;Poliak et al., 2018;Schuster et al., 2019). In the following analysis, we choose the FakeNewsNet dataset (Shu et al., 2017) as a representative example. We first examine the most salient features in the dataset. To achieve this, we train a Logistic Regression (LR) model on the titles of FakeNewsNet using Bag-of-Words features and show the word features with the highest weights for each class in Table 3. 8 The features in the table show clear patterns: the top-features for the reliable (positive) class are either stop words (e.g., 'at', 'the', etc.) or words presumably carrying neutral semantics (e.g. 'season', '2018', 'awards', etc.) while the top features for the unreliable news (negative) class are mostly celebrity names. Using this basic model, we achieve an accuracy of ∼78% , while using a BERT-based model that uses both the article and title as input only achieves an incremental improvement yielding an accuracy of 81% (see Sec. 4.1 for detailed model descriptions). By examining the articles in the dataset, we attribute this to the selection bias exhibited by fact-checking websites. Most unreliable (negative) articles contain click-bait titles mentioning celebrities, while reliable sources usually have less sensational titles with fewer mentions of celebrities and more diverse keywords.
Another potential problem is the articles' retrieval framework. FakeNewsNet uses Google search to retrieve the original news article (Shu et al., 2017). Internet search engines have proprietary news ranking and verification processes, which means that even when using the original title and source of a given article, the search results  might prioritize specific sites over others leading to inaccurate data collection. While Shu et al. (2017) propose several heuristics to handle these problems, it is unlikely that this noisy process is completely fixed. As a result, we find a few mis-matched titlecontent pairs where the retrieved article cannot support the label, hence making the example confusing. We show one example with a questionable label in Table 5, where we suspect the inconsistency is due to the noisy retrieval step. Finally, the informal nature of user-generated content on social media may be the source of additional biases. In our preliminary experiments, we found that in r/Fakeddit dataset, a simple Bag-of-Words(BoW)-based logistic regression model can reach equal-or even better-performance than the reported BERT-based models (86.91% vs. 86.44% in the text-only two-way classification setting), hinting at the strong correlation between the label and lexical inputs. This is also reflected in the equally confusing most salient features in this dataset shown in Table 3.
Since different collection procedures and data resources will lead to different problems, there is no uniform solution to producing a completely biasfree dataset. However, one good test is to check the performance of a simple model such as a BoWbased linear model. By analyzing the features learned by the simple model as well as measuring the gap between the performance of a state-of-theart system and the simple model, one can get a hint of the dataset quality. Unreasonable features, together with small performance gaps, may reveal unwanted biases in the dataset. In practice, we also suggest that when developing models using biased datasets to use debiasing techniques (e.g. Schuster et al. (2019)).

Dataset Split Effect
In this section, we study the effect of time and site/outlet overlap between the training and the evaluation set on the model's performance and show how these confounding factors can impact it.

Baseline Models & Experimental Setup
In the following experiments, we use two models: a logistic regression baseline and a state-of-theart large-scale pretrained Transformer-based model (RoBERTa; Liu et al. (2019)).

Logistic Regression (LR):
We use scikit-learn's (Pedregosa et al., 2011) implementation of Logistic Regression along with TFIDF-based Bag-of-Words features. We add L2 regularization to the model with a regularization weight of 1.0 and train the model using L-BFGS. In our experiments, the LR model uses only the title (and not the article body) as the input.
RoBERTa: Our implementation is based on the Transformers library (Wolf et al., 2019) and Al-lenNLP (Gardner et al., 2017). We use RoBERTa in two different ways, one takes only the title as the input, the other takes both the title and the article content as the input and formalizes the task as pairwise sentence classification. Specifically, we concatenate the title and the article content with a [SEP] token in the middle and use different token type embeddings to differentiate between the title and the content. Articles are truncated to fit the 512-token length limit. In the title-only setting, the batch size is set to 32, the learning rate is set to 5e-5, and the model is trained for 3 epochs. In the article+title setting, the batch size is set to 8, the learning rate is set to 2e-5, and the model is trained for 10 epochs. These hyperparameters are set empirically, and our preliminary experiments show that the results are not sensitive to different settings of these hyperparameters.

Datasets:
Here, our analysis focuses on the 2018 version of the NELA dataset (Horne et al., 2018b). Unlike FakeNewsNet, NELA gathers news directly from news outlets, so the influence of selection bias is insignificant. Thus we focus our analysis on  other potentially confounding factors in the dataset. We use the latest aggregated site-level labels provided in NELA-GT-2019 (Gruppi et al., 2020) and report both the article-and site-level accuracy. For article-level accuracy, we assign the site-level label to all articles from that news outlet and calculate per-article accuracy. For the Source (Site) Split setting (with no overlap between training and evaluation sites), we also report the site-level accuracy: we aggregate the predictions over individual articles for a given outlet and use the majority prediction as the site-level prediction. We use a balanced label distribution for all dataset splits. The results in the third column of Table 6 show the models' performance on the random split, which is the default split method used in most papers, e.g. (Nakamura et al., 2020;Horne et al., 2018a). As the results show, even the simplest logistic regression model achieves an accuracy of over 77% whereas the RoBERTa model using both title and the news article as the input reaches almost 97% accuracy.

Effect of Split by Source
For this experiment, instead of using the standard random split of all the news articles in the dataset, we first randomly split all the sites in the dataset into three disjoint sets (train/dev/test) before adding all articles from each site to their assigned set (train, dev or test). We believe this setup is closer to real-world tasks. For instance, in order to block all unreliable news sources, one simple-yet useful-approach is to maintain a list of questionable sources. All the news from those sources will be automatically blocked. In this setting, the only remaining task is classifying sources with no or very few annotated examples. As the results in Table 6 show, there is a significant drop in performance for all the models when compared to the random split. The logistic regression model's performance drops from 77.5 to 67.2%, and even the more powerful RoBERTa model with both title and article as input drops from 96.9 to 80.4%, demonstrating  the task's significantly increased difficulty. While aggregating article-level results to site-levels can significantly improve the accuracy, we also see a plateauing trend of the performance where adding the article as additional input brings no further improvement to the RoBERTa model. Since we subsample the original dataset and balance the number of news articles for each label, the majority baseline (at the article level) is always 50%. But the site-level majority baseline is well above random (69.29%). While a new 50% majority baseline can be achieved by re-subsampling the dataset, the current number also indicates a severe imbalance of dataset size between reliable/unreliable sites which can-potentially-be exploited by the models.
Random Label Experiments: For this experiment, we use the original random split strategy. However, we permute all the site-level labels randomly. Hence each label no longer represents the reliability of the site, and is just an arbitrary feature of the site itself. Therefore, the only way for the models to achieve good performance on this task is to memorize the arbitrary site-label mapping. The results in Table 7 show that the models achieve very high accuracy with the more powerful RoBERTa model with both title and article showing only~2% accuracy loss when compared to the true labels. These results demonstrate the models' ability to memorize random site-labels, and the similarity between these results and the results on the random splits suggest that the models are bypassing the real task of reliable/unreliable news classification and are just memorizing the site identities.  Performance Variance and Site Similarity Analysis: Another interesting observation from the results in Table 6 is that while the performance on every random split is fairly stable, the performance is much more unstable with respect to splitting by source. For example, the RoBERTa (Title+Article) model results have a standard deviation larger than 10 points, with the highest accuracy reaching over 90% and the lowest one below 60%. One potential factor behind the varying performance is the heterogeneity of different news sources (sites). News sites that are similar to those in the training set could be much easier to classify than sites with completely different styles or content. In this case, even when splitting by site, correlations between the content of similar sites in the training and evaluation sets may drive the generalization performance. To assess this hypothesis, we measure the dependence on the distances between sites in the training and evaluation sets and the model performance at the site level in the evaluation set. Given a set s in the evaluation set, we measure its similarity to all the sites in the training set t ∈ S train . Below we show that higher accuracy on the site s is associated with a higher similarity between s and sites in the training set with the same label t ∈ S same , providing evidence in favor of our hypothesis.
In order to measure the similarity between different sites, we take the representation learned by the RoBERTa model as the representation of the article with a focus on its reliability. Since the RoBERTa model feeds the whole sentence into the multi-layer transformer architecture and feeds the representation of [CLS] token to the downstream classifier (Devlin et al., 2019;Liu et al., 2019), we use the same [CLS] representation as the representation for the whole title+article input.
For similarity-metrics between sites, we follow Guo et al. (2020) and calculate the l2-distance, cosine distance, MMD (maximum mean discrepancy) distance (Gretton et al., 2012;Li et al., 2015) and the CORAL (correlation alignment) distance . Following Guo et al. (2020), the l2 and cosine distances are calculated by first averaging all the example representations to get the site representation and calculating the distance between site representations; the MMD distance is calculated using an unbiased finite sample estimate from Li et al. (2015); and the CORAL distance is calculated by , where d is the feature dimension, C s and C t are the co-variance of two sets and · 2 F is the squared matrix Frobenius norm. To simplify our analysis, we filter out all the sites containing less than 100 examples (assuming the articles from these sites are too few to significantly influence the model). For every site in the evaluation set s, we calculate its distance with respect to every different site t in the training set, and then compare its minimum distance w.r.t the subset of sites with the same gold label S same and the subset of sites with the opposite label S oppo , We compute this ratio using all four distances above for the top and bottom 10 sites in the evaluation datasets (ranked based on their accuracy with RoBERTa) and report the mean over all the sites and over all five different random splits in Table 8. The top 10 sites always have a much larger similarity score than the bottom 10 sites, indicating that they have a much larger similarity with sites in the training sets with the same label. This trend holds across all of the distance metrics. The sensitivity of performance on the site similarity raises additional concerns about how the results in Table 6 may generalize in real-life. As newly emerged unreliable sites are likely to behave differently from old sites, the model's performance may be on the lower end of the variance.
As a natural extension, we also explored building a model that directly optimizes these site-level distance metrics in order to have better site-level generalization performance. However, in our preliminary results, our model does not show significant improvement from the baseline models. This can also hint at the fact that it is very difficult for these models to extract features that are useful to the task of reliable/unreliable news classification itself and instead learn site-specific features.

Effect of Split By Time
Another potentially important factor to consider while creating train/test/dev splits for a news-based dataset is time. As news-worthy events happen everyday, multiple news articles from different outlets can report the same event. For example, in the NELA 2018 dataset (Nørregaard et al., 2019), within a period of two days (from 2018/10/01 to 2018/10/02), there are more than 100 news articles from over 60 sources about the US-Canada-Mexico trade accord. Therefore, by remembering the content of the event from one article, the model can easily predict the label for any related news article.
To test the effect of time, we examine the model's performance on news articles from a temporally disjoint dataset. Specifically, since all our models are trained on the NELA-GT-2018 (Nørregaard et al., 2019), we use the NELA-GT-2019 (Gruppi et al., 2020) as the evaluation dataset. We split the news articles in 2019 into twelve months and plot the performance trend in Figure 1. We can see that, unlike the significant performance drop in the source split experiments, we do not observe a clear correlation between the performance and the length of the time gap. Therefore, at least for the current models and datasets, splitting by time does not significantly influence the current results. This finding may result from that the fact that the model is not memorizing the exact events in the training set (this is not limited to the unreliable news domain), or it could be attributed to the noise in the training set (similar events can be reported both in reliable and unreliable sources). However, we do have to point out that our current observation only holds for the current models, and it is possible for more powerful models to memorize all events. In addition, the widest time gap tested here is still within a couple of years, which is still a relatively short time in terms of news events. A longer time gap (or a major event such as COVID-19) may lead to different behavior by the models. So in practice, we nonetheless suggest splitting datasets by time to avoid these issues.

Error Analysis
Here, we conduct an error analysis to see how the model performs with respect to the variation of some other factors of practical interest, such as topic and site size.
The Influence of Topic in Article-Level Prediction: In order to gain better insight on the performance drop in the source split experiments, we perform a deeper investigation of the numbers in Table 6. We first check whether the models show different performance on different topics. To get a high-level understanding of what the topics are, we look at the titles of articles in the evaluation set and calculate words with the highest PMI with the accuracy of prediction of the RoBERTa model. We then use these PMI values as weights and plot the word cloud figures in Figure 2. In the word cloud of correct predictions, we observe many words related to sports events, while words in the incorrect predictions cloud mostly appear in political news. This is not surprising since there is much more of an incentive to interfere with political news than sports news -making the need for more robust models even more pressing for real-world applications.
The Influence of Size in Site-Level Prediction: Finally, we examine the effect of prediction aggregation from article-level to site-level. Unlike in current datasets where most sites can have hundreds or even thousands of articles, a newly-emerged news outlet waiting for classification may only have a very limited number of articles. Accordingly, while in Table 6, we see a general improvement of the aggregation, it is also important to check the aggregation effect when the number of articles in a given site is small.
In Figure 3 we plot the performance of 5 different runs of the RoBERTa (Title+Article) model   against the number of articles on a given site. We can see that the performance is worse when the size of the site is less than 100, demonstrating the difficulty of predicting the reliability of a site given limited resources. It is also surprising to see a significant number of errors even when the site size is over 1000. This indicates the limitation of simply aggregating the site-level prediction at test-time.
Capturing the article-site hierarchy in a better way is a potential future research direction.

Conclusion
In this paper, we took a closer look at current largescale unreliable news detection datasets. We studied their collection procedures and dataset split strategies, and pointed out important flaws in the current approaches. Specifically, we demonstrated that selection bias in dataset collection that often leads to undesired and significant artifacts in these datasets; highlighting confounding factors (e.g., article source, time) in news datasets that can lead to underestimating the difficulty of the task. Finally we provide suggestions on how to better create and process such datasets in the future. We hope our work leads to more high-quality news datasets and that it inspires further work in this direction.