LUX (Linguistic aspects Under eXamination): Discourse Analysis for Automatic Fake News Classification

The democratization/decentralization of both the production and consumption of information has resulted in a subjective and often misleading depiction of facts known as Fake News - a phenomenon that is effectively shaping the perception of reality for many individuals. Manual fact-checking is time-consuming and cannot scale and although automatic fact-checking, vis a vis machine learning holds promise, it is signiﬁcantly hindered by a deﬁcit of suitable training data. We present both a novel dataset, VERITAS(VERIfying Textual Aspects), a collection of fact-checked claims, containing their original documents and LUX(Language Under eXamination), a text classiﬁer that makes use of an extensive linguistic analysis to infer the likelihood of the input being a piece of fake-news.


Introduction
Often defined as the intentional or unintentional spread of false information (K et al., 2019), Fake News has found fertile ground in the actual scenario of ever-growing data consumption and generation, where factors like news source decentralization, citizen journalism, democratization of media and astroturfing 1 (Lee, 2010) make the task of manually checking and correcting disinformation across the internet impractical if not infeasible, (Shao et al., 2016) despite the significant efforts of Fact-Checking Agencies -organised groups of journalists that manually identify and investigate rumours conveyed by Fake-news articles.
Consequently, it is imperative that we develop an efficient and reliable way to account for the veracity of what is produced and spread as information; this process is known as automatic fact-checking. (Hassan et al., 2015) 1 Astroturfing is the practice of masking the sponsors of a message or organization to make it appear as though it originates from and is supported by grassroots participants.
Although the there has been significant research effort to tackle the task of automatic factchecking (Azevedo, 2018), the deficit of datasets containing organic news articles -in their entiretywhich have been manually labeled with respect to their veracity is a common obstacle for the development of supervised classification models. The absence of such datasets makes researchers rely on other approaches, e.g., stance determination (Popat et al., 2017), knowledge base matching (Wu et al., 2014), trust assessment of sources (Balakrishnan and Kambhampati, 2011), data structuring (Conroy et al., 2015), network pattern analysis (Shao et al., 2016), etc.
In this work we present the challenges faced in the process of developing a language model enriched by discourse features for fake-news detection, along with experimental results. The contributions of this work are mainly two: the dataset creation process, described in Section 2 and the introduction of the text classification model, -LUX(Language Under eXamination), in Section 3.
Section 4 brings a comprehensive evaluation of both VERITAS and LUX, while also featuring an ablation analysis of the latter.

Available Corpora on Fake News
The deficit of suitable corpora for the intended approach is the main influence behind the creation of the VERITAS Dataset, and by consequence, the VERITAS Annotator. Below we present a list of datasets commonly used in related tasks. Note that although the following are considered valuable resources for many related tasks, none of them include all of the three most important characteristics required for a content based supervised classifier which are i) a significant volume of entries, ii) gold standard labels and iii) the entire fake news articles (i.e., the origin).
Emergent16 a collection of 300 rumours and 2,595 associated news articles -a counterpart to 'origin' in the VERITAS Dataset. Each claim's veracity is estimated by journalists after they have judged that enough evidence has been collected (Ferreira and Vlachos, 2016). Besides the claim labeling, each associated article is summarized into a headline and also labelled according to its stance towards the claim. Given the fixed structured of the website we were able to obtain valid labeled examples using a scraper. 2 Unfortunately they sum up to less than 100 usable claim-origin pairs (discussed in subsection 2.3).
LIAR17 includes around 13K human-labeled short statements which are rated by the factchecking website PolitiFact into: "pants on fire", "false", "barely true", "half true", "mostly true", or "true" (Wang, 2017). The domain-restricted data as well as the reduced length of text that can be retrieved from this corpus makes it unsuitable for generic domain linguistic fake news detection.
FakeNewsNet18 is a data repository containing a collection of around 22K real and fake news obtained from Politifact and GossipCop 3 factchecking websites. Each row contains an ID, URL, title, and a list of tweets that shared the URL. It also includes linguistic, visual, social, and spatiotemporal context regarding the articles. This repository could still be used for supervised learning models if it were not for the fact that it does not provide sufficiently long texts to be used by a classifier based on linguistic aspects. For the same reason, CREDBANK (Mitra and Gilbert, 2015) and PHEME (Derczynski and Bontcheva, 2014) are also unsuitable for the authors' use case. Those three datasets focus on network indicators (e.g. number of retweets, sharing patterns, etc.) of fake news, instead of its contents. CREDBANK is a crowd sourced corpus of "more than 60 million tweets grouped into 1,049 real-world events, each annotated by 30 human annotators", while PHEME includes 2 While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.
FEVER18 (Thorne et al., 2018) created FEVER, a set of more than 185K claims by modifying sentences from a collection of 50K Wikipedia articles. Annotators were tasked with annotating other sentences from the same article in respect to their stance towards the modified sentence. The corpus is the largest to our knowledge, but since it is synthetically created and focused on a sentence-level stance classification approach, it is unlikely to perform efficiently on heterogeneous web documents as a fake news classifier.
Snopes19 (Hanselowski et al., 2019) provides a large collection of more than 16 thousand manually annotated text snippets extracted from 6,422 snopes.com articles. Unfortunately, less than half of those snippets present a stance (agreeing or disagreeing) towards the fact-checked claim. Also, the annotated snippets are, by definition, only a portion of the original article. Nevertheless, an origin identification process could generate a significant amount of valid examples from this data.
Due to space restrictions, we cannot provide a detailed description of the following list of datasets, although is important to include them: BuzzFeed16 (Potthast et al., 2018), Kaggle 4 and NELA17 (Horne et al., 2018).

The VERITAS Dataset
The VERITAS Dataset is, to our knowledge, the most complete data collection of manually annotated claims in regards to their veracity. It is the only dataset to contain not only the mentioned veracity labels but also the document (in its entirety) from which the checked claim originated. VER-ITAS has been developed in a two step process: 1) Fact-Checking articles scraping and 2) Claim Origin Identification.
Step 1: Scraping FCAs As the cost for manually checking a large number of disputed claims is extensive, both in time and money, we have started the dataset creation process by scraping articles from fact-checking agencies and consequently trusting the work made by their journalists that undertake the processes of: 1)selecting controversial claims, 2)leveraging web documents that either support or deny those statements to 3) finally come to a veracity verdict. In simple terms, a Fact-Checking Article (FCA) is a narrative of this investigative process.
For each scraped FCA, we create an entry in the dataset and extract a number of attributes, most importantly: the claim, the veracity label, and the list of hyperlinks to the mentioned web documents, which we call Origin Candidates, since they will be the subject of the Origin Identification process. The code used to scrap the pages is openly available 5 .
Step 2: Claim Origin Identification One of the most important steps of the dataset creation pipeline was a task we defined as "origin identification". In short, after three automatic ways of identifying the article in which a fact-checked claim originated were carried out and yielded nonsatisfactory results, it was decided that a manual annotation process would be used to select the correct entries from the totality of the dataset. An annotation tool 6 was developed in order to make the task easier and faster. This annotation process not only provided a large and complete version of the dataset, but also leaves a possibility for an automation of the origin identification process as a future improvement of the project.
The final structure of each entry contains the following fields: Fact-Checking Article URL, Checked Claim, Claim Label, Tags, FCA date, Origin URL, Origin Domain, Origin Body, Origin Title, Origin Summary, Origin Keywords, Origin Date and Origin Author. Given the limited space, a more in-depth description of each field is not provided but can be found within the supplementary material (appendix 1) and also along an extensive description of the origin annotation process in (Azevedo and Moustafa, 2019). The past versions of the dataset are also openly available 7

Consolidation of VERITAS Dataset
A consolidation of the VERITAS dataset followed the large annotation process over the scraped FCA pages that augmented both the quantity of annotated origins (1032 consolidated origins from more than 10k annotations) and the quality of the anno-5 https://github.com/lucas0/ VeritasCrawler 6 https://github.com/lucas0/Annotator 7 https://github.com/lucas0/ VeritasCorpus tations, measured by Krippendorff's Alpha 8 , reaching a substantial score of 0.6014. This consolidation generated the fourth version of the dataset, here addressed as V4.
Given the constant structure of Emergent.info articles, we have also incorporated its few valid claims, i.e., the ones with "true" or "false" verdict, and their respective sources.
Although the majority of origins obtained from Emergent were linked to "true" claims, when aggregated to the consolidated origins from VERITAS v4.0, the data collection showed a false/true class imbalance ratio of ≈ 1.44. Therefore, in order to quickly obtain "true" labeled news articles to balance the scraped Dataset, reporting articles were scraped and automatically labeled as "true" and composed a separated dataset where their headlines are used for the claim field. The sources of those articles were selected according to studies determining the least biased 9 and/or most trusted 10 news outlets in the U.S..
We are aware that the label assumption of those articles is far from ideal. Notwithstanding, it offers another option of palliative solution for the label unbalance issue and yielded positive results in similar works (Horne and Adali, 2017;Ireland, 2018). It should, however, be tested with caution and compared with other -also sub-optimal -methods, i.e., discarding "false" entries and/or implementing class weights on the model training. Both the collection of reporting articles and the emergent articles are provided separately so they can be optionally disregarded and eventually substituted by gold-standard data. Table 1 provides additional details about each subset.
Since the improvement of incorporating the entries from emergent was still to be evaluated by the proposed classifier, two different sample sets from the trusted sources were created, to balance both the v4.0 dataset by itself (V4+T1), as well as the concatenation of VERITAS and emergent (V4+EM+T2). The evaluation results will be presented at Section 4, as they are also the evaluation for the linguistic model. By comparing both balanced sets we can gain a better understanding of the 8 https://en.wikipedia.org/wiki/ Krippendorff%27s_alpha 9 https://www.businessinsider.com/mostbiased-news-outlets-in-america-cnn-foxnytimes-2018-8 10 businessinsider.com/most-and-leasttrusted-news-outlets-in-america-cnn-foxnews-new-york-times-2019-4

LUX -Language Under eXamination
The core contribution of this work is the investigation of the usage of linguistic aspects as discriminative features in a text classification model that should determine whether the given article is fake or not. We call this classifier LUX, short for Language Under eXamination. Previous work investigated the use of such linguistic aspects as features for similar tasks such as deception detection (Reichel and Lendvai, 2016;Zhou et al., 2004), document clustering (Yu and Hatzivassiloglou, 2003a), text classification (Louis and Nenkova, 2011; Biyani et al., 2016) among others. Related works make use of few (mainly one) of those aspects and the majority of them report an improvement of their results by doing so.
Here we present a set of linguistic aspects that were shown to be correlated to deception. For each of these aspects, we present their contextual definition, along with a short literature review and a description of the methods we use to evaluate its presence or absence in a given piece of text. The objective is to build LUX (Language Under eXamination), a Fake News Classifier, effectively using these linguistic aspects to estimate the likelihood of an article containing fake news. Here, we present the results obtained with two baseline language models (BERT 11 (Devlin et al., 2018) and Word2Vec (W2V) (Mikolov et al., 2013)) towards building this classifier.
We are aware of an imbued redundancy that our features might present, since the aspects analyzed by the different approaches, in some cases, overlap with each other, but expect that the eventual bias this redundancy might add to the model can be overcome with the implementations of techniques such as LDA (Linear Discriminant Analysis) or PCA (principal component analysis).

Linguistic Aspects
Subjectivity Louis and Nenkova (Louis and Nenkova, 2011) observed that general sentences tend to be more subjective. Some of the shallow features that are correlated to the subjectivity level of a sentence are also used in their model, for example, punctuation marks, average number of characters and average number of words.
Pattern 12 , a python library for text analysis, states in its section about subjectivity: "Written texts can be broadly categorized into two types: facts and opinions." Based on a lexicon of adjectives produced for product review analysis, pattern.en provides a function that maps the subjectivity score of a sentence to a range between 0 and 1 depending on the number of adjectives it contains. It also provides implementations of measuring functions for mood and polarity.
Riloff et Wiebe (Riloff et al., 2003) presents a methodology for the creation of the MPQA Subjectivity Lexicon. In summary, the authors: 1) use an automatic subjectivity classifier to label data while also 2) identifying patterns present in the sentences labeled as subjective and 3) use the learned patterns to improve the classification model(1) and iterate between the three steps, making bootstrapping possible. The MPQA Lexicon is also used for us to measure the subjectivity of a given text. Based on the lexicon, (Wilson et al., 2005) also created OpinionFinder, a Subjectivity Classifier.
Another interesting method was presented by (Yu and Hatzivassiloglou, 2003a), where a Naive Bayes classifier is trained over a Wall Street Journal dataset containing two classes: Subjective (every article with type Editorial or Letter to Editor) and Objective (Business or News). By analysing low level features on the texts, the NB classifier achieved a 0.91 recall and 0.86 precision on the binary classification task.
In order to measure the subjectivity of a text, two values are calculated. Both are a sum of each word's subjectivity score normalized by the length of the document (in words) but use as reference different lexicons: the TextBlob 13 (a python library based on Pattern 12 ) lexicon and the MPQA lexicon, described above.  (Prasad et al., 2008). From lexicons (General Inquirer (Stone et al., 1962), MRC (Wilson, 1988) and MPQA) other features like sentiment, subjectivity, polarity, familiarity, concreteness, imageability and meaningfulness are also evaluated. Non-sparse features Brown clusters (Brown et al., 1992) are used to classify words into 100 groups and a vector of corresponding cardinality is used to keep track of the frequency of each class in the input text. Speciteller also uses averaged Word embeddings to represent a sentence embedding. These also are 100-dimensional vectors provided by (Turian et al., 2010).
The ablation results show that Speciteller contributes significantly to the LUX classifier and suggest that the framework could be even more impactful if contemporary word embedding generation techniques were to be used.
Complexity (Biyani et al., 2016) focused on the detection of click-baits (that can be seen as a subcategory of fake news) and reported that features used to measure the formality of a text were the most correlated to click bait articles. Using a slang lexicon and a list of bad words, as well as several readability scores, they obtained a reasonable F-1 score of 74,9.
A 1999 paper by (Heylighen and Dewaele, 1999) presents a famous metric for Formality evaluation, named the F-measure (not to be confused with the F1 score). (Pavlick and Tetreault, 2016) present a statistical model for predicting formality, but do not provide access to the model's code.
Another famous work on the formality area is Coh-Metrix (Graesser et al., 2014), but the only access to its implementation is through a simple HTML portal, so we have discarded this option.
Fortunately, a python library 14 provides several readability measuring tools, including known metrics as the Flesch-Kinkaid (Kincaid et al., 1975) and Coleman-Liau (Coleman and Liau, 1975), LIX (Björnsson, 1968) and RIX, which were also used by (Biyani et al., 2016). Those last two metrics are simple but effective, being the first one (LIX) calculated as W/S + C/W * 100 where W is the number of words in a text, S is the number of sentences and C is the number of complex words (words with more than 6 letters). The RIX metric is a simpler and graded version of LIX and is calculated as C/S.
Another python library 15 , initially developed for the AFEL project (d'Aquin et al., 2018), provides more measuring tools for semantic complexity analyzer. The library starts by identifying the entities present in the input text and the relations between them in order to represent it as a knowledge graph which is then used to extract metrics as number of nodes, radius, assortativity 16 and other graph properties.
Both readability and pySemCom libraries are used by us to implement the highest amount of unique metrics for Complexity, Formality and Readability.
Uncertainty According to (Szarvas et al., 2012), "Uncertainty can be interpreted as lack of information: The receiver of the information cannot be certain about some pieces of information".
Rubin (Rubin et al., 2006) provides a solid survey on Certainty Identification. Building on that, (Vincze, 2015) elaborates on the same subject and achieves great results (Vincze et al., 2008) on the CoNLL Shared Task 10, that aimed for the classification of uncertain texts from the BioScope corpus. The approach was implemented very conveniently as a python library for Uncertainty detection, that is used by us for uncertainty measurement. The classifier is a simple model trained on a corpus of words that were assigned a binary label regarding their certainty. The model only requires the input text to be P.O.S-tagged in order to resolve syntactic ambiguity.
(Reichel and Lendvai, 2016) tried to identify hoax-resolving tweets by using the ratio between four data augmented lexicons (knowledge, report, belief, and doubt) as features, along with low-level syntactic features, not achieving good results.

Loughran and McDonald Sentiment Word
Lists (Loughran andMcDonald, 2011) andMPQA (Deng andWiebe, 2015) are Uncertainty Lexicons that are leveraged by us for the evaluation of this aspect. A simple average of uncertain words over the number of words of the input text is used in our model.
Affect (Pang and Lee, 2008) is an extensive review of the literature on sentiment analysis and opinion mining that encompasses the field of linguistic aspect evaluation, which this work is focused on. (Whissell, 2004) provides the Dictionary of Affect in Language, which includes people's 16 https://en.wikipedia.org/wiki/ Assortativity mean ratings for the Pleasantness, Activation, and Imagery of close to 9,000 words. The dictionary is a lexicon with ratings representing the two main dimensions of emotional space, valence and arousal, along with another rating for people's assessment of imageability, i.e., how easily it is to form a mental picture of a word.
A better definition of Affect in the context of deception detection is necessary in order to decide which resource is more appropriate for the aspect evaluation, for now we are going to let the experiments evaluations indicate what is the most appropriate way of measuring affect for our task.
(Li and Nenkova, 2015) mention the MRC Psycholinguistic Database (Wilson, 1988) has words annotated w.r.t imageability among other aspects, while VADER (Valence Aware Dictionary and sEntiment Reasoner) (Hutto and Gilbert, 2014) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to social media. Thus, it seems to be quite appropriate for us.
For this aspect we make use of two different sentiment classifiers: VADER and Pattern/TextBlob 13 , already mentioned on the Subjectivity section. From each one of the two classifiers we obtain three metrics: the sum of all the positive scores, the sum of all negative scores and the total sum of scores, all averaged respectively by the number of words with positive score, words with negative score and total number of words in the input text. By using these metrics we ensure that statistics as variance and range of emotion within the text is passed to the LUX classifier.
Verbal Immediacy (Mehrabian and Wiener, 1966) first defined Immediacy as a linguistic property that refers to the degree to which a source associates himself/herself with the topics of a message; that is, "immediacy is the degree to which a source approaches or avoids a topic". Based on that definition, (Zhou et al., 2004) measured it by analysing spatial and temporal terms, passive voice ratio, self reference manner and group reference manner, among others. Different works relate the non-immediacy to the presence of deception in text since these try to disassociate oneself from one's communication.
Negative affect and passive voice are some indicators of non-immediacy. Since the first is already addressed by us, we will be using a ratio between passive sentences over the total number of sentences to determine how passive is the text. In this context, a sentence is deemed passive, if it contains a "BE" verb followed by some other, non-BE verb, except for a gerund.
Diversity / Quantity / Pausality Those are syntactic features and some of the previous defined ones already make use of one or more ways of measuring them. For example, the diversity measurement is used to evaluate a sentence's Complexity. Still, there are many different ways to measure diversity and since we intend to remove the redundancy of the features anyways, we will measure it with many different formulas.
In a 2013 article, (Jarvis, 2013) proposed that the six properties of lexical diversity should be measured by Variability, Volume, Evenness, Rarity, Dispersion and Disparity. Using a python library 17 , we measure some of those metrics, namely different types of typetoken ratio (TTR), vocd (McCarthy and Jarvis, 2007) and measure of textual lexical diversity (MTLD) (McCarthy, 2005).
Other simple aspects are also taken into account, as the overall quantity of words in absolute number and by P.O.S.-tag as well as the pausality, measured by the ratio between punctuation marks and number of sentences.

Evaluation and results
In simple terms LUX is a binary model for classifying general text into fake news / real news and it was originally proposed as a way to evaluate the efficiency of the above mentioned linguistic features. Aiming for generality, this model takes a text document (that could be a long article or a simple headline) as sole input and outputs the probabilities of it being fake or not, based on its psycho-linguistic profile and contextual representation. For the latter, different types of text encodings were tested and it became clear that the usage of fixed-size BERT document embeddings outperformed Word2Vec, 17 https://github.com/kristopherkyle/ lexical_diversity which was tested on RNN, LSTM (Hochreiter and Schmidhuber, 1997) and Bi-LSTM (Schuster and Paliwal, 1997), with the latter having the best results, but still inferior to BERT.
After performing a grid search with different optimizers, activation functions, learning rates, training epochs and fully connected layer(FLC) dimensions, the initial model was decided to be composed of a simple ReLu 18 activated 64-dimensional FLC with a dropout of 30% attached to the final layer, of dimensionality 2 where a softmax filter would represent the false and true labels probabilities. Adam (Kingma and Ba, 2014) was the best performing optimizer and a combination of α = 0.001 over 100 training epochs generally yielded the best results. Figure 1 brings an outline of the model.  Table 2 for Accuracy and F1 score come from a 9-fold training over the data. The results for the two best baseline models are also included, namely the same model using only the BERT document embeddings and only the w2v embeddings over a simple Bi-LSTM with 128 dimensions on the recurrent layer.
Since the data from FEVER18 (Thorne et al., 2018) and Snopes19 (Hanselowski et al., 2019) is composed of short statements a comparative analysis is also presented alongside a V4+EM+T2 run using only the claims as input text, instead of the larger body texts.
The final input for each article is a an ensemble of a document embedding generated by BERT trained on the BERT-Large uncased corpus 19 and the 97 linguistic features described in the previous section. A version of the code repository is available at https://github.com/lucas0/Lux. Given the initial results, the robustness added from the a different source, i.e. emergent, with the benefit from balancing classes using the trusted news (T2) yielded the best results. Consequently, it was decided this was the selected subset for the linguistic features ablation analysis. Table 3 presents the three most impactful positive and negative features, i.e. features that, when removed, most decrease or most increase the accuracy of the model, respectively. Those are all results using as base the best model run, i.e., LUX model over the V4+EM+T2 data, depicted in Table 2. A longer table containing the results for the full ablation analysis can be found within the supplementary material (appendix 2).

Ablation
Positive Features(PF): When individually removed, each of the 97 features of the model, 50 have report a decreased accuracy of the model by an average of 0.056%, where 21 'contribute' with more than the average of all the positive features and only 10 features decrease more than 1% accuracy when absent. All three top PF fall into the Quantity group, as P.O.S.-tag counts, while most of the most sophisticated, i.e. higher semantic level, make to the top 10. Besides the ones featured(pun intended) in the table, the top 10 also comprises, unordered: Pausability, Coleman-Liau informality score, specificity, measure of lexical textual diversity(MTLD), and three features from the semantic complexity evaluator (Venant and d'Aquin, 2019): assortativity, average number of in-links, and the density. In short, those features are metrics from a graph generated from entities identified in the text, when matched against DBpedia knowledge graph. They refer to, respectively, the similarity of connections with respect to the vertice the number of edges a vertex has to other vertices; the number of links that go from entities of the global DBPedia to the identified entities; and the density of a graph stresses how much nodes are connected to each other.
Negative Features(NF): As expected, the negative features account for the other 47 features. On average, each negative feature increases the accuracy of the model by 0.6% when removed individually. From those, 17 have a better-than-average impact. Avoiding the risk of removing important features from the model and given the high number of negative features, we mention the 9 features that, when not considered, improved LUX's accuracy by more than 1%, but focus the discussion on the top 3. Our results point to the number of VBD (verbs, in the past tense form) in the input text as being the third least important feature of the model, while the top two NF are metrics from the same complexity evaluation approach mentioned above. They are nbTypesStd and diameter of nodes, meaning respectively: the standard deviation on the number of different link types per node and the "spreadness"(sic.) of concepts, i.e., the more unrelated and specific concepts we have, the higher the diameter will be. The other six NF improved the accuracy of the model in more than 1% when removed are: the number of words P.O.S.-tagged as PDT (predeterminer), two readability metrics (Dale-Chall and Flesch Reading Ease) and three other features from PySemCom: number of entities, entities density in the text, standard deviation over the number of in-links.

Conclusion and Future Work
This work has done the following two significant contributions: i)the consolidation of the VERITAS Dataset, which is unique due its provision of organic origins for each given claim in the collection, which has, in turn, been manually verified by FCAs. Given the completeness of the released data, it can be an useful resource for a number of related tasks, namely: Document Retrieval, Stance Detection and Claim Validation. As a second contribution, we have confirmed the hypothesis that the inclusion of linguistic metrics as model features allows for a better text classification performance, at least in the target task of identifying fake-news. After having set up an initial version of the classifier, named LUX, we could demonstrate an improvement from its first evaluation by increasing the quality and quantity of the training data, as well removing the most negative features from the model. The final LUX version performs better than both tested baselines. When used to evaluate the quality of datasets, LUX yields better scores when trained with VERITAS, than when compared with two other fake-news datasets, FEVER18 and Snopes19.
Future work would involve the development of an automatic origin identification step for the VER-ITAS dataset would allow for a much larger version of it, which in turn could further enhance the classification model (LUX). If this step is achieved, a bootstrapping loop for claim veracity checking with origin identification would be complete, and both the inclusion of new entries to the data collection as well as the further training of classification model could be fully automated, having as their only bottleneck, the permanent scraping of manually fact-checked claims, which is already an automatic process.
Another enhancement being added to this work is the output and analysis of BERT attention weights (Vaswani et al., 2017) for both explainability and interpretability of the model. (Yin et al., 2016;Rush et al., 2015) Increasing the size of the VERITAS dataset could also be achieved by leveraging the work done by (Hanselowski et al., 2019) and identifying as the origins of a claim, the website containing the snippets annotated as 'supportive' of the claim. This task is currently ongoing. Checked Claim The main affirmation being verified in the article.
Claim Label The verdict, along with the source document of a fact candidate compose the input/outcome pairs of the dataset to be used in our classification model. In other applications or tasks it might not even be necessary.
We assign the gold-standard status to this annotation, given that each one of those checked documents was manually investigated by one or more fact-checking journalists, before coming to a verdict regarding its veracity, and thus, are as trustworthy as the journalists and correspondent fact-checking agencies themselves.
Different FCAs use different labels, e.g.'mostly-true', 'mixture', 'unproven', etc. consequently there is a need for normalization or removal of the ones that cannot be directly mapped into "true" nor "false".
Tags The set of tags used by the journalist that wrote the fact-checking article. These are mainly used for navigation within the website but could be used for clustering of the dataset and retrieval of other claims regarding the same topic.
FCA Date The date the claim was checked by one of the fact-checking agencies.
Origin URL The URL of the web document that originated the claim, i.e. its origin. Here, origin is defined as a source that directly supports the claim.
Note that an origin does not have to be the very first article that stated the claim and that there could be multiple origins for a single claim.
Origin Domain The origin URL domain. This can have great impacts in results of a neural network classifier's accuracy, or even in the weighting of a simpler classifier method. Examples of using the URL domain as a feature for it's content veracity are not new. (Nakashole and Mitchell, 2014;Balakrishnan and Kambhampati, 2011) Origin Body The whole text extracted from the origin URL. Which method is used to obtain the Origin Body is the main difference across the versions, as discussed above.
Origin Title The title of the origin page. This is another possibly useful feature for related tasks or extra features for our classifier. (Popat et al., 2017) Since the title and the checked claim have similar lengths, using this attribute instead of the whole origin text would have probably yielded better results on the stance classification ranking.
Origin Summary Besides being faster than the previously used crawling methods, the current version of the crawler 5 also generates a summary of the origin. This could be a valuable piece of information but would demand checking whether it is a valid depiction of the content of the origin.
Origin Keywords Similar to the Tags of the factchecking article with the difference that these are obtained by the great article curator news-paper3k 20 . This could also be used as a feature for the Origin Identification Classifier (see on Future Work section).
Origin Date The date at which the origin article was published.