Zero-shot Fact Verification by Claim Generation

Neural models for automated fact verification have achieved promising results thanks to the availability of large, human-annotated datasets. However, for each new domain that requires fact verification, creating a dataset by manually writing claims and linking them to their supporting evidence is expensive. We develop QACG, a framework for training a robust fact verification model by using automatically generated claims that can be supported, refuted, or unverifiable from evidence from Wikipedia. QACG generates question-answer pairs from the evidence and then converts them into different types of claims. Experiments on the FEVER dataset show that our QACG framework significantly reduces the demand for human-annotated training data. In a zero-shot scenario, QACG improves a RoBERTa model’s F1 from 50% to 77%, equivalent in performance to 2K+ manually-curated examples. Our QACG code is publicly available.


Introduction
Fact verification aims to validate a claim in the context of evidence. This task has attracted growing interest with the rise in disinformation in news and social media. Rapid progress has been made by training large neural models Liu et al., 2020b;Zhong et al., 2020) on the FEVER dataset (Thorne et al., 2018), containing more than 100K human-crafted (evidence, claim) pairs based on Wikipedia.
Fact verification is demanded in many domains, including news articles, social media, and scientific documents. However, it is not realistic to assume that large-scale training data is available for every new domain that requires fact verification. Creating training data by asking humans to write claims and search for evidence to support/refute them can be extremely costly.
We address this problem by exploring the possibility of automatically generating large-scale (evidence, claim) pairs to train the fact verification model. We propose a simple yet general framework Question Answering for Claim Generation (QACG) to generate three types of claims from any given evidence: 1) claims that are supported by the evidence, 2) claims that are refuted by the evidence, and 3) claims that the evidence does Not have Enough Information (NEI) to verify.
To generate claims, we utilize Question Generation (QG) (Zhao et al., 2018;Liu et al., 2020a;Pan et al., 2020), which aims to automatically ask questions from textual inputs. QG has been shown to benefit various NLP tasks, such as enriching QA corpora (Alberti et al., 2019), checking factual consistency for summarization , and data augmentation for semantic parsing (Guo et al., 2018). To the best of our knowledge, we are the first to employ QG for fact verification.
As illustrated in Figure 1, given a passage P as the evidence, we first employ a Question Generator to generate a question-answer pair (Q, A) for the evidence. We then convert (Q, A) into a claim C (QA-to-Claim) based on the following logical assumptions: a) if P can answer Q and A is the correct answer, then C is a supported claim; b) if P can answer Q but A is an incorrect answer, then C is a refuted claim; c) if P cannot answer Q, then C is a NEI claim. The Question Generator and the QA-to-Claim model are off-the-shelf BART models , finetuned on SQuAD (Rajpurkar et al., 2016) and QA2D (Demszky et al., 2018) datasets.
We generate 100K (evidence, claim) pairs for each type of claim, which we then use to train a RoBERTa  model for fact verification. We evaluate the model on three test sets Figure 1: Overview of our QACG framework, consisting of two modules: 1) Question Generator generates questions from the evidence P and the extra contexts P ext given different answers extracted from the passage (in green), and 2) QA-to-Claim converts question-answer pairs into claims with different labels. based on the FEVER dataset. Although we do not use any human-labeled training examples, the model achieves over 70% of the F 1 performance of a fully-supervised setting. By finetuning the model with only 100 labeled examples, we further close the performance gap, achieving 89.1% of fullysupervised performance. The above results show that pretraining the fact verification model with generated claims greatly reduces the demand for in-domain human annotation. When evaluating the model on an unbiased test set for FEVER, we find that training with generated claims also produces a more robust fact verification model.
In summary, our contributions are: • To the best of our knowledge, this is the first work to investigate zero-shot fact verification.
• We propose QACG, a novel framework to generate high-quality claims via question generation.
• We show that the generated training data can greatly benefit the fact verification system in both zero-shot and few-shot learning settings.

Methodology
Given a claim C and a piece of evidence P as inputs, a fact verification model F predicts a label Y ∈ {supported, refuted, NEI} to verify whether C is supported, refuted, or can not be verified by the information in P.
For the zero-shot setting, we assume no humanannotated training example is available. Instead, we generate a synthetic training set based on our QACG framework to train the model.

Question Generator and QA-to-Claim
As illustrated in Figure 1, our claim generation model QACG has two major components: a Question Generator G, and a QA-to-Claim model M.
The Question Generator takes as input an evidence P and a text span A from the given evidence and aims to generate a question Q with A as the answer. We implement this with the BART model , a large transformerbased sequence-to-sequence model pretrained on 160GB of text. The model is finetuned on the SQuAD dataset processed by Zhou et al. (2017), where the model encodes the concatenation of the SQuAD passage and the answer text and then learns to decode the question. We evaluate the question generator using automatic and human evaluation and investigate its impact on fact verification in Appendix A.
The QA-to-Claim Model takes as inputs Q and A, and outputs the declarative sentence C for the (Q, A) pair, as shown in Figure 1. We also treat this as a sequence-to-sequence problem and finetune the BART ) model on the QA2D dataset (Demszky et al., 2018), which contains the human-annotated declarative sentence for each (Q, A) pair in SQuAD.

Claim Generation
Given the pretrained question generator G and the QA-to-Claim model M, we then formally introduce how we generate claims with different labels.
Supported claim generation. Given an evidence P , we use named entity recognition to identify all entities within P , denoted as E. For each entity a ∈ E, we treat each a in turn as an answer and generate a question q = G(P, a) with the question generator. The question-answer pair (q, a) are then sent to the QA-to-Claim model to generate the supported claim c = M(q, a).
Refuted claim generation. To generate a refuted claim, after we generate the questionanswer pair (q, a), we use answer replacement (shown in Figure 1) to replace the answer a with another entity a with the same type such that a becomes an incorrect answer to the question q. Using a as the query, we randomly sample a phrase from the top-5 most similar phrases in the pretrained Sense2Vec (Trask et al., 2015) as the replacing answer a . The new pair (q, a ) is then fed to the QAto-Claim model to generate the refuted claim.
To avoid the case that a is still the correct answer, we define rules to ensure that the a has less lexical overlap with a. However, this problem is sometimes non-trivial and cannot be completely avoided. For example, for the QA pair: ("Who is the producer of Avatar?"; "James Cameron"), another valid answer a is "Jon Landau", who happens to be another producer of Avatar. However, we observe that such coincidences rarely happen: among the 100 randomly sampled claims, we only observed 2 such cases. Therefore, we leave them as the natural noise of the generation model. NEI claim generation. We need to generate a question q which is relevant but cannot be answered by P. To this end, we link P back to its original Wikipedia article W and expand the evidence with additional contexts P ext , which are five randomly-retrieved sentences from W that are not present in P. In our example in Figure 1, one additional context retrieved is "By the time the riots ended, 63 people had been killed". We then concatenate P and P ext as the expanded evidence, based on which we generate a supported claim given an entity in P ext as the answer (e.g., "63"). This results in a claim relevant to but unverifiable by the original evidence P.

Experiments
By applying our QACG model to each of the 18, 541 Wikipedia articles in the FEVER training set, we generate a total number of 176, 370 supported claims, 360, 924 refuted claims, and 258, 452 NEI claims. Our generated data is around five times the size of the human-annotated claims in FEVER. We name this generated dataset as QACG-Full. We then create a balanced dataset QACG-Filtered by randomly sampling 100, 000 samples for each class. Statistics of the FEVER and the generated dataset are in Appendix B.
Evaluation Datasets. We evaluate fact verification on three different test sets based on FEVER: 1) FEVER-S/R: Since only the supported and refuted claims are labeled with gold evidence in FEVER, we take the claim-evidence pairs of these two classes from the FEVER test set for evaluation. 2) FEVER-Symmetric: this is a carefullydesigned unbiased test set designed by Schuster et al. (2019) to detect the robustness of the fact verification model. Note that only supported and refuted claims are present in this test set. 3) FEVER-S/R/N: The full FEVER test set are used for a three-class verification. We follow Atanasova et al. (2020) to use the system of Malon (2019) to retrieve evidence sentences for NEI claims. Table 1, we take a BERT model (S1) and a RoBERTa model (S2) fine-tuned on the FEVER training set as the supervised models. Their corresponding zero-shot settings are Rows U5 and U6, where the models are trained on our generated QACG-Filtered dataset. Note that for binary classification (FEVER-S/R and FEVER-Symmetric), only the supported and refuted claims are used for training, while for FEVER-S/R/N, the full training set is used.

Fact Verification Models. As shown in
We employ four baselines that also do not need any human-annotated claims to compare with our method. Random Guess (U1) is a weak baseline that randomly predicts the class label. GPT2 Perplexity (U2) predicts the class label based on the perplexity of the claim under a pretrained GPT2 (Radford et al., 2019) language model, following the assumption that "misinformation has high perplexity" (Lee et al., 2020a). MNLI-Transfer (U3) trains a BERT model for natural language inference on the MultiNLI corpus (Williams et al., 2018) and applies it for fact verification. LM as Fact Checker (Lee et al., 2020b) (U4) leverages the implicit knowledge stored in the pretrained BERT language model to verify a claim. The implementation details are given in Appendix C. Table 1 summarizes the fact verification performance, measured by the macro Precision (P ), Recall (R), and F1 Score (F 1 ).

Main Results
Supervised S1. BERT-base    some of the annotation artifacts present in FEVER, leading to a more robust fact verification model.

Few-shot Fact Verification
We then explore QACG's effectiveness in the fewshot learning setting where only a few humanlabeled (evidence, claim) pairs are available. We first train the RoBERT-large fact verification model with our generated dataset QACG-Filtered. Then we fine-tune the model with a limited amount of human-labeled claims in FEVER. The blue solid line in Figure 2 shows the F 1 scores on FEVER-Symmetric after finetuning with different numbers of labeled training data. We compare this with training the model from scratch with the human-labeled data (grey dashed line). Our model performs consistently better than the model without pretraining, regardless of the amount of labeled training data. The improvement is especially prominent in data-poor regimes; for example, our approach achieves 78.6 F 1 with only 50 labeled claims for each class, compared with 52.9 F 1 without pretraining (+25.7). This only leaves a 7.9 F 1 gap to the fully-supervised setting (86.5 F 1 ) with over 100K training samples. The results show pretraining fact verification with QACG Evidence Generated Claim Budapest is cited as one of the most beautiful cities in Europe, ranked as the most liveable Central and Eastern European city on EIU's quality of life index, ranked as "the world's second best city" by Conde Nast Traveler, and "Europe's 7th most idyllic place to live" by Forbes.

SUPPORTED claims
Budapest is ranked as the most liveable city in central Europe. Budapest ranks 7th in terms of idyllic places to live in Europe.

REFUTED claims
Budapest ranks in 11th in terms of idyllic places to live in Europe. Budapest is ranked the most liveable city in Asia.

NEI claims
Budapest is one of the largest cities in the European Union.
Budapest is the capital of Hungary.
Alia Bhatt received critical acclaim for portraying emotionally intense characters in the road drama Highway (2014)   greatly reduces the demand for in-domain humanannotated data. Our method can provide a "warm start" for fact verification system when applied to a new domain where training data are limited. Table 2 shows representative claims generated by our model. The claims are fluent, label-cohesive, and exhibit encouraging language variety. However, one limitation is that our generated claims are mostly lack of deep reasoning over the evidence. This is because we finetune the question generator on the SQuAD dataset, in which more than 80% of its questions are shallow factoid questions.

Analysis of Generated Claims
To better understand whether this limitation brings a domain gap between the generated claims and the human-written claims, we randomly sampled 100 supported claims and 100 refuted and analyze whether reasoning is involved to verify those claims. We find that 38% of the supported claims and 16% of the refuted claims in FEVER require either commonsense reasoning or world knowledge to verify. Table 3 show three typical examples. Therefore, we believe this domain gap is the main bottleneck of our system. Future studies are required to generate more complex claims which involves multi-hop, numerical, and commonsense reasoning, such that we can apply our model to more complex fact checking scenario.

Conclusion and Future Work
We utilize the question generation model to ask different questions for given evidence and convert question-answer pairs into claims with different labels. We show that the generated claims can train a well-performing fact verification model in both the zero-shot and the few-shot learning setting. Potential future directions could be: 1) generating more complex claims that require deep reasoning; 2) extending our framework to other fact checking domains beyond Wikipedia, e.g., news, social media; 3) leveraging generated claims to improve the robustness of fact checking systems.

Ethical Considerations
We discuss two potential issues of claim generation, showing how our work sidesteps these issues. While individuals may express harmful or biased claims, our work only focuses on generating factoid claims from a corpus. In this work, we take Wikipedia as the source for objective fact. Practicing this technique thus requires the identification of an appropriate source of objective truth to generate claims from. Another potential misuse of claim generation is to generate refuted claims and subsequently spread such misinformation. We caution practitioners to treat the generated claims with care. In our case, we use the generated claims only to optimize for the downstream fact verification task. We advise against releasing generated claims for public use -especially on public websites, where they may be crawled and then subsequently used for inference. As such, we will release the model code but not the output in our work. Practitioners can re-run the training pipeline to replicate experiments accordingly.

A Evaluation of Question Generation
To implement the question generator, we finetune the pretrained BART model provided by Hugging-Face library on the SQuAD dataset. The codes are based on the SimpleTransformers 2 library. The success of our QACG framework heavily rely on whether we can generate fluent and answerable questions given the evidence. Therefore, we separately evaluate the question generator using both automatic and human evaluation and investigate its impact to zero-shot fact verification.

A.1 Automatic Evaluation
We employ BLEU-4 (Papineni et al., 2002), ME-TEOR (Lavie and Agarwal, 2007), and ROUGE-L (Lin, 2004) to evaluate the performance of our implementation. We compare the BART model with several state-of-the-art QG models, using their reported performance on the Zhou split of SQuAD. Table 4 shows the evaluation results comparing against all baseline methods. The BART model achieves a BLEU-4 of 21.32, outperforming NQG++, S2ga-mp-gsa, and CGC-QG by large margins. This is as expected since these three baselines are based on Seq2Seq and do not apply language model pretraining. Compared with the current stateof-the-art model UniLM, the BART model achieves comparable results, with slightly lower BLEU-4 but higher METEOR.

A.2 Impact of Answerability
Given the evidence P and the answer A, the generated question Q must be answerable by P and  take A as its correct answer. This is the premise of generating a correct SUPPORTED claim. Therefore, we specially evaluate this answerability property via human ratings. We randomly sample 100 generated question-answer pairs with their corresponding evidence and ask two workers to judge the answerability of each sample. We do this for both the NQG++ model and the BART model. To investigate the impact of question quality on the fact verification performance, we separately use the NQG++ and BART as the question generator to generate claims and train the RoBERTa model. The performance is summarized in Table 5. We find that the ratio of answerable questions generated by the BART model is 89.5%, significantly outperforms the 63.5% achieved by the NQG++ model. When switching the question generator to NQG++, the fact verification F 1 drops to 62.3 (−22.1% compared with BART). This shows that answerability plays an important role in ensuring the validity of the generated claims and has a huge impact on the fact verification performance. Table 6 shows the basic data statistics of the FEVER, FEVER-Symmetric, and our generated dataset by QACG. We use the balanced dataset QACG-Filtered sampled from QACG-Full to train the fact verification model in the zero/few-shot setting. Compared with the original FEVER dataset, our generated QACG-Filtered dataset has a balanced number of claims for each class. Moreover, because QACG can generate three different types of claims for the same given evidence (shown in Figure 1), it results in a more "unbiased" dataset in which the model must rely on the (evidence, claim) pair rather than the evidence itself to make an inference of the class label.

C Model Implementation Details
BERT-base and RoBERTa-large (S1, S2, U5, U6). We use the bert-base-uncased  (110M parameters) and the roberta-large (355M parameters) model provided by Hugging-Face library to implement the BERT model and the RoBERTa model, respectively. The model is fine-tuned with a batch size of 16, learning rate of 1e-5 and for a total of 5 epochs, where the epoch with the best performance is saved.
GPT2 Perplexity (U2). To measure the perplexity, we use the HuggingFace implementation of the medium GPT-2 model (gpt2-medium, 345M parameters). We then rank the claims in the FEVER test set by their perplexity under the GPT-2 model. We then predict the label for each claim based on the assumption that misinformation has high perplexity. However, manually setting the perplexity threshold is difficult. Since the FEVER test set contains an equal number of claims for each class, we predict the claims in the top 1/3 of the ranking list as refuted, and the bottom 1/3 as supported.
The rest claims are set as NEI. Therefore, the number of predicted labels for each class is also equal.
MNLI-Transfer (U3). We use the HuggingFace -BERT base model (110M parameters) fine tuned on the Multi-Genre Natural Language Inference (MNLI) corpus 3 , a crowd-sourced collection of 433K sentence pairs annotated with textual entailment information. We then directly apply this model for fact verification in the FEVER test set. The class label entailment, contradiction, and neutral in the NLI task is mapped to supported, refuted, and NEI, respectively, for the fact verification task.
LM as Fact Checker (U4). Since there is no public available code for this model, we implement our own version following the settings described in Lee et al. (2020b). We use Hugging-Face's bert-base as the language model to predict the masked named entity, and use the NLI model described in U3 as the entailment model.