Prompt to be Consistent is Better than Self-Consistent? Few-Shot and Zero-Shot Fact Verification with Pre-trained Language Models

Few-shot or zero-shot fact verification only relies on a few or no labeled training examples. In this paper, we propose a novel method called ProToCo, to \underline{Pro}mpt pre-trained language models (PLMs) \underline{To} be \underline{Co}nsistent, for improving the factuality assessment capability of PLMs in the few-shot and zero-shot settings. Given a claim-evidence pair, ProToCo generates multiple variants of the claim with different relations and frames a simple consistency mechanism as constraints for making compatible predictions across these variants. We update PLMs by using parameter-efficient fine-tuning (PEFT), leading to more accurate predictions in few-shot and zero-shot fact verification tasks. Our experiments on three public verification datasets show that ProToCo significantly outperforms state-of-the-art few-shot fact verification baselines. With a small number of unlabeled instances, ProToCo also outperforms the strong zero-shot learner T0 on zero-shot verification. Compared to large PLMs using in-context learning (ICL) method, ProToCo outperforms OPT-30B and the Self-Consistency-enabled OPT-6.7B model in both few- and zero-shot settings.


Introduction
The problem of misinformation has sparked significant attention on the task of fact verification within the natural language processing (NLP) community.Such task, typically represented by Fact Extraction and VERification (FEVER) benchmark (Thorne et al., 2018), requires models to verify if pieces of evidence support, refute, or contain not enough information (NEI) to validate a given claim.
Fully supervised fact verification has been widely studied and achieved good performance on the data of different domains (Nie et al., 2019;Ma et al., 2019;Wadden et al., 2020;Guo et al., 2022).However, collecting a large set of training data is laborintensive, time-consuming and costly especially

Evidence:
Coronavirus disease 2019 is a zoonotic infectious disease caused by severe acute respiratory syndrome coronavirus 2.

Original Claim
The Coronavirus disease 2019 has a zoonotic origin.

Variants
It is true that the Coronavirus disease 2019 has a zoonotic origin.
It is unclear that the Coronavirus disease 2019 has a zoonotic origin.
It is false that the Coronavirus disease 2019 has a zoonotic origin.with the constant emergence of new events, such as COVID-19 (Lee et al., 2021;Pan et al., 2021;Saakyan et al., 2021), that may be out-of-domain.Few-shot fact verification is an urgent need but has been paid little attention because its performance is previously not competitive given very few training data (Lee et al., 2021;Zeng and Zubiaga, 2022), not to mention the zero-shot setting without any labeled data available at all.

Supports Refutes
In this paper, we try to improve PLMs' capability on factuality assessment for few-shot and zero-shot evidence-based fact verification.In general, consistency in fact verification dictates our assessment on the veracity of a claim based on the evidence given.For example, Figure 1 shows that given the same evidence and three major variants of the claim, the judgement of factuality on the confirmation variant "It is true that [claim]" should remain the same as that of the original claim, while the judgement on the uncertainty variant "It is unclear that [claim]" and negation variant "It is false that [claim]" should be opposite to that of the original claim.The relations (i.e., confirmation, uncertainty and negation) between the claim and its variants naturally constrain what decisions should be made for the variants once the decision on the claim is determined, and vice versa.Such simple consistency constraints with minor adjustments can be generalized to different cases (e.g., when the evidence refutes the claim.See Section 4.3 for detail).Meanwhile, prior studies on consistency in other domains (e.g., knowledge base and question answering (QA)) have shown a strong correlation between PLM's performance and their self consistency (Elazar et al., 2021;Wang et al., 2022), but it is empirically observed that PLMs are insufficient to transfer self-consistency to downstream tasks (Ettinger, 2020;Kassner and Schütze, 2020;Kassner et al., 2021;Elazar et al., 2021).We therefore aim to explicitly impose consistency on PLMs for improving few-shot and zero-shot fact verification performance.
Inspired by the recent success of prompt-enabled PLMs on various few-shot NLP tasks via forming natural language prompts using templates (Radford et al., 2019;Brown et al., 2020;Gao et al., 2021;Liu et al., 2022a), we construct the variants of a given claim by simply altering prompt templates while keeping the claim itself unchanged.Further, we define a factuality-grounded consistency mechanism based on the aforementioned relations between the claim and its variants, and assign the labels (i.e., support, refute, and NEI) satisfying the consistency to the variants, so that we obtain a set of claim-evidence pairs with consistency constraints.To bring such consistency to PLMs, we then use these pairs to fine-tune T-Few (Liu et al., 2022a), a prompt-enabled PLM with a parameterefficient fine-tuning (PEFT) method by only updating a small number of parameters.We name our method as ProToCo, Prompt PLMs To be Consistent, for improving the consistency of PLMs for few-shot and zero-shot fact verification.Our main contributions can be summarized as follows 1 : • We design a general factuality-grounded consistency scheme to provide explicit consistency constraints for improving few-shot fact assessment, which is generalizable to zero-shot setting.
1 Code and dataset are available at https://github.com/znhy1024/ProToCo • We propose ProToCo, a novel prompt-based consistency training method for improving PLMs on few-shot and zero-shot fact verification.
• Evaluation results on three public fact verification datasets from different domains confirm that ProToCo outperforms the state-of-the-art fewshot baselines by up to 30.4% relative improvement in terms of F1, and also consistently outperforms the strong zero-shot learner T0-3B (Sanh et al., 2022) in zero-shot verification.
• When compared to large PLMs in both settings, ProToCo achieves overall better performance than OPT-30B (Zhang et al., 2022) and significantly outperforms the Self-Consistency-enabled OPT-6.7Bmodel based on Chain-of-Thought (CoT) prompting (Wang et al., 2022).

Related Work
Existing methods tried to address few-shot fact verification by utilizing the implicit knowledge of PLMs encoded in their parameters without gradients update.Lee et al. (2021) hypothesized that the perplexity of concatenated claim-evidence text sequence evaluated by a language model could benefit claim verification, and used a few training instances to find the threshold of perplexity scores for determining the label of test claim.Zeng and Zubiaga (2022) utilizes PLMs to create a set of representative vectors for each class based on the semantic difference between claim and evidence of a few training instances, which are used to label test claims based on Euclidean distance during inference.However, these models do not update model parameters solely relying on the preencoded knowledge of PLMs, which cannot improve the language model itself and may not generalize well in new domains.And they also cannot perform zero-shot task as a few labeled instances are required as the anchors for labeling new instances.Our method aims to update PLMs efficiently to utilize new knowledge in a few examples and enforce model's consistency for improving both few-shot and zero-shot verification.
Recently, several studies worked to generalize PLMs to the target domain by fine-tuning the model with the full training dataset of fact verification from a different domain (Wadden et al., 2020;Saakyan et al., 2021;Schuster et al., 2021;Wadden et al., 2022).Meanwhile, some works targeted to instruct PLMs to generate task-specific training data used to fully train a classifier for fact verification (Pan et al., 2021;Wright et al., 2022).Such works need a carefully crafted generation policy based on real corpus of the task, and the performance heavily depends on the quality of generated data.These approaches are considered distantly supervised, and significantly differ from ours as they do not aim to build any few-shot or zero-shot model.Unlike these studies, we assume that the language model is minimally aware of fact verification task with only a few task-specific examples, which may be even unlabeled.
In general, PLMs have shown strong few-shot learning ability in various NLP tasks (Brown et al., 2020;Sanh et al., 2022).In-context learning (ICL) uses natural language prompts or instructions to elicit desired output from PLMs without gradient updates (Radford et al., 2019;Brown et al., 2020).However, ICL is hard to deal with many prompted instances (Liu et al., 2022a), sensitive to the prompt design (Liu et al., 2022b;Lu et al., 2022) and performs worse than fine-tuning (Brown et al., 2020;Liu et al., 2022a).An alternative approach is parameter-efficient fine-tuning (PEFT) by updating only a small number of parameters to bridge the gap with standard fine-tuning (Houlsby et al., 2019;He et al., 2022;Mahabadi et al., 2021;Lester et al., 2021;Wei et al., 2022;Ben Zaken et al., 2022;Liu et al., 2022a).Our method utilizes T-few (Liu et al., 2022a), a state-of-the-art PEFT-enabled model, as our backbone to perform the factuality-grounded consistency training.
Previous works evaluate the self-consistency of PLMs by modifying the context of input sentences (Ettinger, 2020;Kassner and Schütze, 2020;Ravichander et al., 2020;Elazar et al., 2021) and empirically show that PLMs are insufficient to transfer self-consistency to downstream tasks.Some works in question answering (QA) prompt large PLMs (e.g., GPT-3 (Brown et al., 2020)) to improve QA accuracy by strengthening the consistency of predicted answers.Wang et al. (2022) prompts PLM to generate multiple explanations and candidate answers and choose the answer that consistently occurs as the prediction.The Maieutic prompting (Jung et al., 2022) designed for True-or-False commonsense QA, and ConCoRD (Mitchell et al., 2022) designed specifically based on selfconsistency benchmarks, both of which elicit PLMs to generate distributions for possible candidate an-swers, followed by a MaxSAT solver (Battiti, 2009) to infer the most probable answer by eliminating contradictory candidates.Both methods are based on different consistency definitions from ours and may not be suitable for the fact verification task.
3 Problem Definition . The zero-shot setting is similar but only uses x i for each instance and the unlabeled training set is given as C zs train = {(x i )} 3K .Note that the absence of ground-truth label makes the setting zero-shot (Wright et al., 2022;Zhou et al., 2022).Similar to previous works (Lee et al., 2021;Liu et al., 2022a), we do not assume the availability of development set as it is more realistic in a limit data scenario.Our goal is to generalise a PLM M θ to the unseen test set C test , fine-tuned only using C f s train or C zs train , where θ denotes language model parameters.

Prompt Construction
Given a labelled instance (x i , y i ), the input x i (i.e., c i and e i ) and the label y i are firstly reformatted as a natural language input and response using a prompt template T , which consists of an input template T x and a target template T y .For example, as shown in Figure 2, the reformatted input T x (x i ) is obtained by filling the evidence and claim in their corresponding fields: Suppose {e i }.Can we infer {c i }?
and the reformatted label can be T y (y Here Choices is a promptspecific target words mapping containing response keys {Yes, Maybe, No}, where Yes is mapped to Support, Maybe to NEI, and No to Refute.The original input (ORG) and its variants are used to train the PLM.We use a PEFT method (i.e., (IA) 3 ) to train the PLM, which only updates the parameters of additionally learned vectors while other parameters are frozen.Consistency will be imposed as the constraints on PLM's predictions over the claim and its variants.

Inference
We take the text-to-text PLM (e.g., T5 (Raffel et al., 2020)) as M θ since the prompted input and output are text sequences.Let V be the vocabulary of M θ .We denote each T x (x i ) as an input sequence of tokens x i and T y (y i ) as a target sequence of tokens y i = {t j ∈ V, j ∈ [1, |y i |]} to be generated.Then, the probability of the target sequence is , where p θ (t j | x i , t <j ) is the probability of each token t j assigned by the model M θ during autoregressive generation given the input sequence x i and the tokens generated prior to t j .Since the sequence y i corresponds to the class y i , the predicted score for class y i given by M θ can be defined as the log-probability normalized by the length of output sequence to avoid possible bias towards length (Liu et al., 2022a): In this way, we obtain the predicted scores of all classes using Equation 1 and use rank classification for inference by following (Liu et al., 2022a).All classes are ranked by the predicted scores and the top-ranked class is taken as the prediction.

The Consistency Mechanism
In this section, we describe how to establish the consistency for fact verification task.Our goal is two-fold: 1) construct a set of variants for a claim corresponding to three basic logical relations between the claim and a variant, i.e, confirmation, uncertainty, and negation; 2) the labels of variants can be unambiguously derived based on the relations above once the label of original claim-evidence pair is known.To this end, we construct the logical variants by modifying the prompt input template T x , as shown in Figure 2.
Specifically, we prepend "it is {w} that" before c i to get a claim's logical variants, where w ∈ V can be an affirmative word (e.g., true), an uncertain word (e.g., unclear), or a negative word (e.g., false), corresponding to the aforementioned relations.Figure 2 shows the consistency constraints that the model should strive to satisfy based on the set of labels assigned to the original claim and its variants given T y (y i ).For example, when T y (y i ) is Yes or No, the label of the confirmation variant should be same as that of the original claim since they entail each other, while the negation variant should have the opposite label because of their contradiction, and the uncertainty variant is assigned as No since the evidence indicates it is sufficient to draw a certain conclusion.Situation is slightly different when T y (y i ) is Maybe since there is not enough evidence to support or refute the confirmation and negation variants, and as a result, both confirmation and negation variants are designated as Maybe while the uncertainty variant is assigned as Yes.With these consistency constraints, we could label the claim variants for each training instance and utilize them for fine-tuning the model.

Training Strategy
It is challenging for few-shot fine-tuning of PLMs as updating a large number of parameters with a few instances may result in unstable performance.Also, there are no labeled instances available for zero-shot fine-tuning.We will introduce how to bring our consistency mechanism into the training of PLMs in both settings.
We exploit the T-Few recipe which applies a PEFT method called (IA) 3 (Liu et al., 2022a) on a zero-shot learner T0 (Sanh et al., 2022) to enable its few-shot ability.The (IA) 3 modifies Transformer (Vaswani et al., 2017) via multiplying the keys and values in attention and the intermediate activations of position-wise feed-forward networks by the learned vectors, so that a small number of parameters are introduced for fine-tuning.And T0 has been endowed with a strong zero-shot generalizability by training a LM-adapted T5 (Lester et al., 2021) on a set of datasets covering numerous NLP tasks, where each training instance is reformatted as a natural language input and response using a prompt template.

Loss Functions
With a few training instances, we follow Liu et al.
(2022a) by combining several different loss functions to update the new parameters.
• Standard cross-entropy loss encourages M θ to assign higher probability p θ (y i | x i ) to the correct target sequence y i given the input sequence x i : • Classification task loss is based on crossentropy.Given the predicted scores β(x i , y i , T ) assigned by PLM, the probability of predicting class y i can be calculated as: and the loss for the task is: • Unlikelihood loss forces incorrect target sequences to be assigned with lower probabilities (Welleck et al., 2020): The total loss for fine-tuning our backbone model T-Few is a sum of the above three losses:

Few-Shot and Zero-Shot Training
In the few-shot setting, we first fine-tune the model with the original labeled instances as a warm-up, and then continue the fine-tuning with the created variants and the logically consistent labels which are derived from the claim following the proposed consistency mechanism (see Section 4.3).
Given no labeled instance in the zero-shot setting, we directly fine-tune the model with the variants using the following strategy: at each training step, the prediction of the original instance by the PLM is used to assign pseudo labels to its variants based on the proposed consistency mechanism.To some extent, this training strategy provides a regulation to PLM and guide it to update the prediction on the original instance.Note that such method is still zero-shot since what is considered in training is only the determined logical relations between the claim and its variants and no ground-truth information is exploited.

Datasets
We use three public fact verification datasets from different domains.Their statistics are shown in

Experimental Settings
For few-shot fact verification, we report 4-shot experiments as the main result.We also conduct Kshot experiments for K = {1, 2, 4, 8, 16} reported as supplementary results.For zero-shot experiments, we randomly sample 30 instances per class from each training set for fine-tuning.Note that no labels are used in this setting.For fair and robust comparison, we sample the training instances based on four random seeds and report the mean performance of macro-F1 and standard deviation over these four splits in all experiments.The seeds and data splits are kept the same across different models.
We use the original source code2 of T-Few (Liu et al., 2022a) with its released pre-trained checkpoint of 3B parameters as our backbone model.Following the T-Few paper, we randomly sample a prompt template from the Public Pool of Prompts (P3) (Bach et al., 2022) for each instance at each training and inference step to increase the diversity and variability of prompts used.We set training steps as 1,500, batch size as 4, and learning rate as 1 × 10 −4 for both few-shot and zero-shot settings3 .
For fine-tuning the RoBERTa-L model, we follow Lee et al. (2021) using 2 × 10 −5 as learning rate and 32 as batch size, and train it for 10 epochs.We use the original code of GPT2-PPL 4 and conduct experiments using GPT2-base as the backbone following the original setting.Additionally, we also present the results of GPT2-PPL with a larger backbone GPT2-xl 5 .We reproduce SEED following the original implementation details in the Table 2: Results of different few-/zero-shot fact verification methods in 4-shot and 0-shot settings on three datasets.We report the macro-F1 averaged over 4 trials with randomly selected training samples from the datasets using different seeds.The best results are in bold while the second results are underlined.The standard deviation is in (.).
paper (Zeng and Zubiaga, 2022) with BERT nli 6 as its base model which was fine-tuned on NLI tasks.Furthermore, we report the results of SEED using the pre-trained model all-mpnet-base-v2 7 as backbone since it provides the best quality of sentence embeddings in all pre-trained models of sentence transformers (Reimers and Gurevych, 2019) 8 .We use the code and pre-trained checkponit with 3B parameters of T0 from Hugging Face Transformers 9 .All the experiments use a server with 4 NVIDIA Tesla-V100 32GB GPUs.

Few-Shot Result
The results of few-shot fact verification are reported in Table 2.We have the following observations.Firstly, given very few labeled instances, RoBERTa-L does not always improve few-shot performance, which is consistent with the empirical finding that traditional fine-tuning of PLMs is unstable in the few-shot setting (Zhang et al., 2021;Mosbach et al., 2021;Dodge et al., 2020).
Secondly, with the designs for few-shot learning on PLMs, both versions of GPT2-PPL and SEED achieve much better performance than the majority class and RoBERTa-L, without any gradient update.With different backbone models, GPT2-PPL xl outperforms GPT2-PPL base due to its larger model size, while SEED mpnet lags far behind SEED nli possibly because the base model BERT nli fine-tuned on NLI task can be more readily adapted to the fact verification task compared to the base model all-mpnet-base-v2, which was fine-tuned on the 6 https://huggingface. co/sentence-transformers/ bert-base-nli-mean-tokens 7 https://huggingface.co/ sentence-transformers/all-mpnet-base-v2 8 https://www.sbert.net/docs/pretrained_models.html#model-overview 9 https://huggingface.co/bigscience/T0 sentence matching task.On SciFACT and FEVER datasets, SEED nli with the semantic difference vector outperforms GPT2-PPL xl that predicts labels based on a perplexity score.However, SEED nli is less advantageous on VitaminC as the semantic vector becomes less likely to identify the subtle factual differences in the contrastive instances.
Thirdly, our backbone model T-Few clearly outperforms both versions of GPT2-PPL and SEED, which indicates that only relying on the implicit knowledge of PLMs without parameter update is insufficient for few-shot fact verification.Also, compared to RoBERTa-L, the obtained improvements on all datasets shows the PEFT method (IA) 3 helps address the instability issue of traditional fine-tuning methods on PLMs under fewshot setting.
Lastly, ProToCo with consistency training leads to consistent gains on all datasets, considerably improving T-Few by 30.4%, 6.3% and 4.7% on Sci-FACT, VitaminC and FEVER, respectively, which demonstrates the effectiveness of imposing the consistency constraints on model training.

Zero-Shot Result
We examine the effectiveness of ProToCo in zeroshot setting, where it only uses a small number of unlabelled instances for training.The zero-shot result is also given in Table 2.
We can see that ProToCo performs better than T0-3B on all the datasets, achieving improvements by 7.4%, 5.1% and 3.5% F1 on FEVER, SciFACT and VitaminC, respectively.And our consistency training also improves T-Few by 10.6% and 8.5% in FEVER and SciFACT, respectively.However, ProToCo performs slightly worse than T-Few on VitaminC.Given the contrastive construction approach of VitaminC dataset, we conjecture that this is might be because consistency training alone may not be able to effectively enhance the base model's ability to distinguish the contrastive instances without any supervision signals or prior adversarial training for the base model.One possible solution to address the issue is to use a stronger base model, which is pre-trained with adversarial data, to better capture the subtle differences in the contrastive instances.We will leave this to future work.

Impact of Shots Number
Figure 3 illustrates the comparison between fewshot baselines and ProToCo as the number of shots K increases.ProToCo consistently outperforms the few-shot baselines at all K on the three datasets.The curves of both versions of SEED and GPT2-PPL models quickly saturate compared to ProToCo and changing to a larger backbone cannot bring much improvements in GPT2-PPL method as K increases, suggesting that fine-tuning PLMs is necessary for improving few-shot performance for new knowledge to be learnt.
Interestingly, the improvement of ProToCo over T-Few becomes clearly smaller as K increases on FEVER and VitaminC that are based on Wikipedia data (as Wikipedia-like data might be seen during PLM pre-training), but on the scientific domain dataset SciFACT, consistency training still can lead to a modest improvement even when K reaches 16 shots and is inclined to grow continuously.This indicates the consistency training is especially helpful when the PLMs knows little about the type of data in Scientific domain.
As K increases, ProToCo continues to narrow the gap with the fully-supervised model that was fine-  tuned on the full training set.On FEVER, only using 4 labeled instances per class, it already outperforms the fully-supervised model.On Vitam-inC, however, the trend suggests that its performance is not likely to catch up with the fullysupervised model.Our analysis shows that the chance is low to be able to sample contrastive instances into such limited number of shots of training data.As a consequence, the contrastive nature of this dataset might be underrepresented by the sampled instances, potentially limiting the model from effectively learning such features.We believe that using more training instances or a base model pre-trained on contrastive data might boost ProToCo's performance on VitaminC, but we will leave this to future work.

Comparison to ICL of Large PLMs
We compare ProToCo to ICL of relatively large PLMs in both few-shot and zero-shot settings.Specifically, we compare to OPT (  2022) with 30B parameters10 -10 times larger than ProToCo, which is an open-source large causal language model with similar performance as GPT-3 (Brown et al., 2020).Results in Table 3 show that ProToCo achieves much higher F1 score compared to the few-shot ICL with OPT-30B.Compared to the zero-shot ICL with OPT-30B, Pro-ToCo clearly outperforms ICL on FEVER and Vi-taminC datasets and performs equally well on Sci-FACT.This confirms the effectiveness of ProToCo in both settings and demonstrates how the consistency training method enables a smaller PLM to compete with the ICL method using a much larger PLM on fact verification task.

Comparison to Self-Consistency Models
We compare ProToCo with the Self-Consistency Chain-of-Thought (SelfconCoT) method (Wang et al., 2022), which samples multiple outputs from a language model and returns the most consistent answer in the set.We implement the SelfconCoT method following the details described in (Wang et al., 2022) and use OPT with 6.7B parameters as its base model and sample 20 outputs for each instance 11 .Experiments are conducted with 3 training instances (1-shot) and evaluated on the full test set of SciFACT, and a random subset of the test set in FEVER and VitaminC given the limited resources we have.
Results in Table 4 show that ProToCo significantly outperforms SelfconCoT on all datasets, despite the fact that the latter has 2 times more parameters, suggesting that PLM with our consistency training is more suitable for fact verification task.Additionally, using the same hardware, ProToCo is considerably more efficient than SelfconCoT, taking around 6 hours to finish 4 runs of experiments thanks to the PEFT method, while SelfconCoT needs around 14 hours.

Conclusion and Future Work
We propose a model called ProToCo to improve few-and zero-shot fact verification based on consistency training of PLMs.Experiments on three public datasets show that ProToCo achieves promising fact verification performance by outperforming the existing few-and zero-shot baselines, the in-context learning on large PLMs, and the self-consistency chain-of-thought method.Our method also outperforms fully-supervised model on FEVER dataset.
In the future, we will explore few-and zero-shot solutions for other stages of fact-checking, e.g., evidence retrieval and justification generation, and combine them with ProToCo.We also plan to conduct experiments to evaluate the performance and level of consistency of larger language models (e.g., GPT-3 (Brown et al., 2020), InstructGPT (Ouyang et al., 2022) and LLaMA (Touvron et al., 2023)) on the fact verification task, when the computing resources are available.

Limitations
While ProToCo works well with our consistency training for improving fact verification under fewshot and zero-shot settings, our work has some limitations.Due to limited resources, currently we were unable to conduct comparison with larger PLMs and examine if extremely large models have already developed the similar or better level of consistency for fact verification on their own.In addition, our experiments show that consistency training brings improvements in both settings using only gold evidence.However, the retrieved evidence in realworld setting can be noisy and incomplete.That said, the performance of ProToCo on non-oracle evidence requires further study.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: An illustration of our consistency mechanism for evidence-based fact verification when the evidence supports the claim.The PLM's judgements on the variants should be logically consistent across the different variants of the claim.

Figure 2 :
Figure2: The architecture of our ProToCo model.Given a claim-evidence pair, confirmation variant (CON), uncertainty variant (UNC) and negation variant (NEG) are created by modifying the prompt template.The original input (ORG) and its variants are used to train the PLM.We use a PEFT method (i.e., (IA) 3 ) to train the PLM, which only updates the parameters of additionally learned vectors while other parameters are frozen.Consistency will be imposed as the constraints on PLM's predictions over the claim and its variants.

Figure 3 :
Figure 3: The performance comparison under different the number of shot K.For all K tested, ProToCo consistently outperforms all the baselines.We report the results of fully-supervised models using oracle evidence as a reference: the results of RoBERTa-large model from Pradeep et al. (2021) and Pan et al. (2021) on SciFACT and FEVER, respectively; the result of ALBERT-xlarge model (Lan et al., 2020) on VitaminC is obtained by evaluating test set using the provided checkpoint and original code from Schuster et al. (2021).
be a fact verification dataset, containing training set C train and test set C test , where each instance consists of the input x i and ground-truth label y i ∈ Y and Y = {Support, NEI, Refute}.Let x i = (c i , e i ), and the task aims to predict if the given pieces of evidence e i supports, refutes or has not enough information to validate the claim c i .In the few-shot setting, we randomly sample K instances per class from C train for training as the class distribution is unknown.As a result, the total number of instances is 3K and the few-shot training set is denoted as

Table 1 .
FEVER (Thorne et al., 2018)provides manually crafted claims by altering factual sentences from Wikipedia.The claims are classified as Support, Refute or NEI by annotators.This dataset only provides gold evidence for

Table 1 :
Statistics of three datasets used for evaluation.

Table 4 :
(Wang et al., 2022)f-Consistency Chain-of-Thought(Wang et al., 2022)using 3 training instances.The evaluation of FEVER and VitaminC are based on a random subset of test set given limited resources.
To utilize consistency constraints, ProToCo still needs to fine-tune the PLMs.Also, in zero-shot setting, the labels of logical variants are assigned with the predictions of the original claim by the base model, which could be inaccurate and thus affect the consistency training.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section 5.1.3C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Section 5.2, 5.3, 5.4, 5.5 and 5.6C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section 5.1.3,5.6 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.