X-Fact: A New Benchmark Dataset for Multilingual Fact Checking

In this work, we introduce : the largest publicly available multilingual dataset for factual verification of naturally existing real-world claims. The dataset contains short statements in 25 languages and is labeled for veracity by expert fact-checkers. The dataset includes a multilingual evaluation benchmark that measures both out-of-domain generalization, and zero-shot capabilities of the multilingual models. Using state-of-the-art multilingual transformer-based models, we develop several automated fact-checking models that, along with textual claims, make use of additional metadata and evidence from news stories retrieved using a search engine. Empirically, our best model attains an F-score of around 40%, suggesting that our dataset is a challenging benchmark for the evaluation of multilingual fact-checking models.


Introduction
Curbing the spread of fake news and misinformation on the web has become an important societal challenge. Several fact-checking initiatives, such as PolitiFact, 1 expend a significant amount of manual labor to investigate and determine the truthfulness of viral statements made by public figures, organizations, and social media users. Of course, since this process is time-consuming, often, a large number of falsified statements go unchecked.
With the aim of assisting fact-checkers, researchers in NLP have sought to develop computational approaches to fact-checking (Vlachos and Riedel, 2014;Wang, 2017;Pérez-Rosas et al., 2018). Many such works use the FEVER dataset, which contains claims extracted from Wikipedia documents (Thorne et al., 2018). Using real-world claims, Wang (2017)  Although misinformation transcends countries and languages (Bradshaw and Howard, 2019;Islam et al., 2020), much of the recent work focuses on claims and statements made in English. Developing Automated Fact Checking (AFC) systems in other languages is much more challenging, the primary reason being the absence of a manually annotated benchmark dataset for those languages. Moreover, there are fewer fact-checkers in these languages, and as a result, a non-English monolingual dataset will inevitably be small and less effective in developing fact-checking systems. As recent research points out, a possible solution in dealing with data scarcity is to train multilingual models (Aharoni et al., 2019;Wu and Dredze, 2019;Hu et al., 2020). Indeed, this finding motivates us to construct a large multilingual resource that the research community can use to further the development of fact-checking systems in languages other than English.
Recent efforts in the construction of a multilingual dataset are limited, both in scope and in size (Shahi and Nandini, 2020;Patwa et al., 2020). For instance, FakeCovid, a dataset introduced by Shahi and Nandini (2020) contains 3066 non-English claims about COVID-19. In comparison, X-FACT contains 31,189 general domain non-English claims from 25 languages. Moreover, FakeCovid contains only two labels, namely, False, and Others. We argue that this is undesirable, as fact checking is a fine-grained classification task. Due to subtle differences in language, most claims are neither entirely true nor entirely false (Rashkin et al., 2017). In contrast, our dataset contains seven labels-we make distinctions between true, mostly true, half-true etc. Table 1 shows two such examples from German and Brazilian Portuguese.
In summary, our contributions are: 1. We release a multilingual fact-checking benchmark X-FACT, which includes 31,189 short statements labeled for factual correctness and covers 25 typologically diverse languages across 11 language families. X-FACT is an order of magnitude larger than any other multilingual dataset available for fact checking.
2. Apart from the standard test set, we create two additional challenge sets to evaluate fact checking systems' generalization abilities across different domains and languages.
3. We report results for several modeling approaches and find that these models underperform on all three test sets in our benchmark, suggesting the need for more sophisticated and robust modeling methods.
The X-FACT dataset, and the code for our experiments, can be obtained at https://github.com/ utahnlp/x-fact.
2 The X-FACT Dataset X-FACT is constructed from several fact-checking sources. We briefly outline this process here.
Sources of Claims. We relied on a list of nonpartisan fact-checkers compiled by International Fact-Checking Network (IFCN) 2 , and Duke Reporter's Lab 3 . We removed all the websites that conduct fact-checks in English and are covered by previous work (Wang, 2017;Augenstein et al., 2019). As a starting point, we first queried Google's Fact Check Explorer (GFCE) 4 for all the fact-checks done by a particular website. Then we crawled the linked article on the website and additional metadata such as claimant, URL, date of the claim. For websites not linked through GFCE, we directly crawled all the available fact-checking articles from the fact-checker's website. We left out some fact-checkers because either the claims on their websites were not well specified or the factchecker did not use any rating scale. We performed semi-automated text processing to remove duplicate claims and examples where the label appeared in the claim itself. This resulted in data from a total of 85 fact checkers for further processing. Refer to the appendix for more details on the this process.
Filtering the Dataset. There are two major challenges in using the crawled data directly: a) the labels are in different languages, and b) each fact checker uses a different rating scale for categorization. To deal with these issues, first, we manually translated all ratings to English, followed by semiautomatic merging of labels if they were found to be synonyms. Second, in consultation with Factly, 5 an IFCN signatory, we created a rating scale compatible with most fact-checkers. Our label set contains five labels with a decreasing level of truthfulness: True, Mostly-True, Partly-True, Mostly-False, and False. To encompass several other cases where assigning a label is difficult due to lack of evidence or subjective interpretations, we introduced Unverifiable as another label. A final label Other was used to denote cases that do not fall under the above-specified categories. Following the process described, we reviewed each fact-checker's rating system along with some examples and manually mapped these labels to our newly designed label scheme. See  We found that the data from several sources was dominated by a single label (> 80%). Since it is difficult to train machine learning models on highly imbalanced datasets, we removed 54 such websites. We additionally removed fact-checking websites that contained fewer than 60 examples. In total, our dataset contains 31,189 fact-checks.
A Single Test Set is Not Sufficient. Recent advances in NLP have shown that multilingual models are effective for cross-lingual transfer (Kondratyuk and Straka, 2019; Wu and Dredze, 2019; Hu et al., 2020). A multilingual fact-checking system of similar transfer capabilities will certainly be an asset, especially in languages with no or few fact-checkers. From this perspective, we seek to provide a robust evaluation benchmark that can help us understand the generalization abilities of our fact-checking systems.
With this objective, we construct three test sets, namely α 1 , α 2 , and α 3 . 6 The first test set (α 1 ) is distributionally similar to the training set. The α 1 set contains fact-checks from the same languages and sources as the training set.
Second, the out-of-domain test set (α 2 ), contains claims from the same languages as the training set but are from a different source. A model that performs well on both α 1 and α 2 can be presumed to generalize across different source distributions.
Third test set is the zero-shot set (α 3 ), which seeks to measure the cross-lingual transfer abilities of fact-checking systems. The α 3 set contains claims from languages not contained in the training set. Models that overfit language-specific artifacts will underperform on α 3 .  (835). We split the data into training (75%), development (10%), and α 1 test set (15%). This leaves us with 13 languages for our zero-shot test set (α 3 ). The remaining set of sources form our out-of-domain test set (α 2 ). See table 2 for the number of claims and langauges in each of these splits.
In total, X-FACT covers the following 25 languages (shown with their ISO 639-1 code for brevity): ar, az, bn, de, es, fa, fr, gu, hi, id, it, ka, mr, no, nl, pa, pl, pt, ro, ru, si, sr, sq, ta, tr. Please refer to the appendix for more details.

Experimental Setting
The goal of our experiments is to study how different modeling choices address the task of multilingual fact-checking. All our experiments use mBERT, the multilingual variant of BERT (Devlin et al., 2019) and use macro F1 score as the evaluation metric. 7 We report average F1 scores and standard deviations on four runs with different random seeds.
We implement the following multilingual models as baselines for future work: 1. Claim Only Model (Claim-Only): We provide textual claim as the only input to the model, in effect treating the problem as a simple sentence classification problem. For a given claim and a collection of n evidence documents, we first encode the claim and evidences separately using mBERT by extracting the output of the CLS token, denoted as: c, [e 1 , e 2 , ..., e n ]. We first apply dotproduct attention (Luong et al., 2015) to obtain the attention weights [α 1 , α 2 , ..., α n ], and then compute a linear combination using these attention coefficients: e = i α i e i . This representation is then concatenated with c and fed to the classification layer. In all our experiments, we fix the number of evidence documents to five.
3. Augmenting metadata (+Meta): We concatenate additional key-value metadata with the claim text by representing it as a sequence of the form: Key : Value (Chen et al., 2019). This metadata includes the language, website-name, claimant, claim-date, and review-date. If a certain field is not available for a claim, we represent the value by none.
All the models are trained in a multilingual setting, i.e., a single model is trained for all languages. We could not use monolingual models as the trained monolingual models were unstable due to the small size of data for each language.

Results
The results are shown in table 3. We will discuss results by answering a series of research questions.
As an indicator of label distribution, we include a majority baseline with the most frequent label of the distribution (i.e. false).
Does the dataset exhibit claim-only bias? Before moving to more sophisticated systems, let us first examine if the model can predict a statement's veracity by only using the textual claim. Note that this setting is similar to that of hypothesis only models for the task of Natural Language Inference (NLI) (Poliak et al., 2018). From table 3, we see that a claim-only model outperforms a majority baseline by a large margin. We can draw two inferences: a) A significant number of examples in α 1 can be labeled by just relying on the textual claim, and b) the claim-only model has learned spurious correlations from the dataset.
Do search snippets improve fact-checking? First, results from table 3 show that augmenting models with metadata is helpful. Second, using search snippets as evidence with an attention-based model along with metadata improves performance by 2.5 percentage points on the in-domain test set (α 1 ). To further validate that snippets indeed help the evidence-based model, we perform another experiment in which we pair each claim with random search snippets of the same language. Since there is no relevant evidence, the performance is indeed similar to the claim-only model. This again confirms our finding that the dataset exhibits some claim-only bias. While the Attn-EA model provides some performance improvement on the in-domain test set, surprisingly, the claim-only model outperforms the evidence-based model by a small margin on α 3 . This might be due to the evidence-based over-fitting the in-domain data.
How informative are the search snippets? Note that we used snippets to summarize the retrieved search results. To gauge the relevance of these snippets, we manually examine 100 examples from α 1 test set for Hindi. Our preliminary analysis reveals that only 45% of snippets provide sufficient information to classify the claim, indicating why the performance increase with the evidence-based model is small. Our same analysis suggests that for 83% of the examples, using full text of the web pages provides sufficient evidence to determine veracity of the claim. Hypothetically, this means, were the models able to ingest large documents (web pages), their performance increase could have been much more significant.
Do the models generalize across sources and languages? We observe that performance on α 2 and α 3 is worse than on α 1 , not only highlighting the difficulty of these challenge sets, but also showing that models overfit both source-specific patterns (α 2 ) and language-specific patterns (α 3 ).
Importantly, these results underscore the utility of our challenge sets in assessing model generalizability as well as diagnosing overfitting.

Conclusion
We presented X-FACT, the currently largest multilingual dataset for fact-checking. Compared to the prior work, X-FACT is an order of magnitude larger, enabling the exploration of large transformerbased multilingual approaches to fact-checking. We presented results for several multilingual modeling methods and showed that the models find this new dataset challenging. We envision our dataset as an important benchmark in development and evaluation of multilingual approaches to factchecking. 1. As mentioned in the paper, we omit several fact-checking websites from our data. A large number of these websites are not amenable to crawling and scraping the data. For instance, AFP 9 is a prominent fact-checker for many Indo-European Romance languages, but the template on its website does not lend itself to automatic data extraction tools. We can try to access this websites using GFCE, but case many times, the ratings assigned are sentences instead of a single label.
2. Another common reason is that on a number of these websites, the claim statements are not well-specified. Take for example Faktograf 10 , a website performing fact-checking in Croatian. On this website , we can neither properly extract the claim statements nor do they clearly mention the rating assigned to the articles.
3. For a small percentage of the claim statements, Google search did not yield any results. We omitted all of these claims from our training, development, and test sets. These are only a very small percentage of claims, so we remove them from all models.
Because of these reasons, a large number of websites in a number of languages could not be crawled.
There are two ways we obtain our claims, labels, and other metadata. One is the Google's Fact Check Explorer (GFCE) 11 , and the other is by crawling from the respective fact-checking website. In case, the links are available on GFCE, we download other metadata by visiting the website. Also, we will release the label mapping we created along with the dataset. Appendix A provides more details on the dataset we collected.

Dataset
Model RunTime X-FACT Claim 1.5 hr X-FACT Claim+Meta 1.5 hr X-FACT Attn-EA 2.3 hr X-FACT Attn-EA + Meta 2.3 hr X-FACT + Eng Claim+Meta 2.5 Hr X-FACT + Eng Attn-EA + Meta 4.1 Hr

B.1 Models and Code
As described in the main paper, we used multilingual BERT for performing our experiments. We implemented all our models in PyTorch using the transformers library (Wolf et al., 2019).

B.2 Computing Infrastructure Used
All of our experiments required access to GPU accelerators. We ran our experiments on three machines: Nvidia Tesla V100 (16 GB VRAM), Nvidia Tesla P100 (16 GB VRAM), Tesla A100 (40 GB VRAM). Our experiments for the claim-only model were run on V100, and P100 GPUs and evidencebased models required larger VRAM, so they were run on A100 GPUs.

B.3 Hyperparameters and Fine-tuning Details
1. We used the mBERT-base model for all of our experiments. This model has 12 layers each with hiddem size of 768 and number of attention heads equal to 12. Total number of parameters in this model is 125 million. We set all the hyper-parameters as suggested by Devlin et al. (2019), except the batch size which is fixed to 8.
2. All our models were run with four random seeds (seed = [1, 2, 3, 4]) and the numbers reported in paper are the means of these four runs. We fine-tuned all models for ten epochs and the model performing the best on development set across all epochs was chosen as the final model.
3. Due to constraints on the VRAM of the GPUs, we restricted the number of evidence documents to five.
Average Run times Average training times are presented in table 5.