QA-NatVer: Question Answering for Natural Logic-based Fact Verification

Fact verification systems assess a claim's veracity based on evidence. An important consideration in designing them is faithfulness, i.e. generating explanations that accurately reflect the reasoning of the model. Recent works have focused on natural logic, which operates directly on natural language by capturing the semantic relation of spans between an aligned claim with its evidence via set-theoretic operators. However, these approaches rely on substantial resources for training, which are only available for high-resource languages. To this end, we propose to use question answering to predict natural logic operators, taking advantage of the generalization capabilities of instruction-tuned language models. Thus, we obviate the need for annotated training data while still relying on a deterministic inference system. In a few-shot setting on FEVER, our approach outperforms the best baseline by $4.3$ accuracy points, including a state-of-the-art pre-trained seq2seq natural logic system, as well as a state-of-the-art prompt-based classifier. Our system demonstrates its robustness and portability, achieving competitive performance on a counterfactual dataset and surpassing all approaches without further annotation on a Danish verification dataset. A human evaluation indicates that our approach produces more plausible proofs with fewer erroneous natural logic operators than previous natural logic-based systems.


Introduction
Automated fact verification is concerned with the task of identifying whether a factual statement is true, with the goal of improving digital literacy (Vlachos and Riedel, 2014).A typical fact verification system consists of an evidence retrieval and a claim verification component (i.e. the judgement of whether a claim is true).The latter is typically implemented as a neural entailment system (Guo et al., 2022, inter alia), which is not transparent in regards to its underlying reasoning.While efforts have been made to improve their explainability, for instance via highlighting salient parts of the evidence (Popat et al., 2018), or generating summaries (Kotonya and Toni, 2020), there is no guarantee that the explanations are faithful, i.e. that they accurately reflect the reasoning of the model (Jacovi and Goldberg, 2020;Atanasova et al., 2023).
Contrarily, proof systems like NaturalLI (Angeli and Manning, 2014), perform natural logic inference as proofs, and are faithful by design.Their transparent reasoning empowers actors to make informed decisions about whether to trust the model and which parts of the prediction to dispute (Jacovi and Goldberg, 2021).Recently Krishna et al. (2022) constructed a natural logic theorem prover for claim verification, using an autoregressive formulation with constrained decoding.However, an important limitation of this approach is its dependence on substantial resources for training, relying on large datasets for claim verification, and structured knowledge bases like the Para-Phrase DataBase (Ganitkevitch et al., 2013), Word-Net (Miller, 1994), and Wikidata (Vrandečić and Krötzsch, 2014).However such manually curated resources are typically accessible for high-resource languages, thus limiting its applicability.
To this end, we propose QA-NatVer: Question Answering for Natural Logic-based Fact Verification, a natural logic inference system that composes a proof by casting natural logic into a question answering framework.As illustrated in Figure 1, a proof is a sequence of steps, with each step describing the semantic relation between a claim span and an evidence span via a set-theoretic natural logic operator (NatOp), and this sequence of NatOps determines the veracity of the claim following a deterministic finite state automaton (DFA).QA-NatVer predicts NatOps using operator-specific boolean questions (cf.Table 1 for examples for all operators).For instance, the relation between the claim span The current veracity state and mutation operator determine the transition to the next state, via a fine state automaton (DFA).Starting at S , the span Anne Rice is mutated via the equivalence operation (≡), resulting in S , according to the DFA.The inference ends in R , indicating the claim's refutation.We use question-answering to predict the NatOps, taking advantage of the generalization capabilities of instruction-tuned language models.was born and the evidence Born is ascribed the equivalence NatOp (≡), which we predict with questions such as Is "was born" a paraphrase of "Born"?.This formulation enables us to make use of instruction-finetuned language models, which are powerful learners, even with limited supervision (Sanh et al., 2022, inter alia).Since the input format to our question-answering formulation constrains the context to the aligned claim-evidence spans, we generate claim-evidence alignments between overlapping spans of varying length, and individually predict the NatOp for each pair of aligned spans.To select the best proof over all possible proofs, we combine the answer scores to the questions associated with each proof.
In a few shot setting with 32 training instances on FEVER, QA-NatVer outperforms all baselines by 4.3 accuracy points, including ProofVER, LOREN (Chen et al., 2022a), and a state-of-the-art few-shot classification system, T-Few (Liu et al., 2022).By scaling the instruction-tuning model from BART0 to Flan-T5, we achieve a score of 70.3 ± 2.1, closing the gap to fully supervised models, trained on over 140,000 instances, to 8.2 points.On an adversarial claim verification dataset, Symmetric FEVER (Schuster et al., 2019), QA-NatVER scores higher than few-shot and fully supervised baselines, including ProofVER, demonstrating the robustness of our approach.In a low-resource sce-nario on a Danish fact verification dataset (Nørregaard and Derczynski, 2021, DanFEVER) without further annotation for training, our system outperforms all baselines by 1.8 accuracy points, highlighting the potential of the question-answering formulation for low-resource languages.An ablation study indicates the benefits of QA-NatVer's question-answering formulation, outperforming ChatGPT (OpenAI, 2022) (over 430x the size of BART0) prompted with in-context examples to predict NatOps by 11.9 accuracy points.Finally, we show in a human evaluation that QA-NatVer improves over previous natural logic inference systems in terms of explainability, producing more plausible proofs with fewer erroneous NatOps.1

Related Work
Natural logic (Van Benthem, 1986;Sanchez, 1991) operates directly on natural language, making it an appealing alternative to explicit meaning representations such as lambda calculus, since the translation of claims and evidence into such representations is error-prone and difficult to decode for nonexperts.NatLog (MacCartney andManning, 2007, 2009) proposes the use of natural logic for textual inference, which has then subsequently been extended by Angeli and Manning (2014) into the Nat-Figure 2: QA-NatVer's proof construction.We first chunk the claim and the evidence, and align them at multiple granularity levels.We then assign a natural logic operator to each aligned claim-evidence span using questionanswering.Finally, we select the proof by combining the answer scores to the questions associated with the proof.uralLI proof system.With the surge of pre-trained language models, multiple works have attempted to integrate natural logic into neuro-symbolic reasoning systems (Feng et al., 2020(Feng et al., , 2022)).In particular, ProoFVer, a natural logic inference system specifically designed for fact verification, achieves competitive performance yet remains faithful and more explainable than its entirely neural approaches (Krishna et al., 2022).Stacey et al. (2022) propose an alternative framework of logical reasoning, evaluating the veracity of individual claim spans (or of atomic facts in Stacey et al. (2023)) and determining the overall truthfulness by following a simple list of logical rules.Chen et al. (2022a) use a similar list of logical rules but aggregate the outcomes with a neural network component.However, all previous approaches have in common that they require substantial training data to perform well, limiting their use to resource-rich languages and domains.
Casting a natural language problem to a questionanswering setting has previously been explored in a variety of tasks such as relation classification (Levy et al., 2017;Cohen et al., 2022), and semantic role labeling (He et al., 2015;Klein et al., 2022).For fact verification, in particular, previous works have considered formulating it as a question generation task, decomposing a claim into relevant units of information to inquire about, followed by question answering to find relevant answers in a large knowledge base (Fan et al., 2020;Chen et al., 2022b).Yet, these works do not consider aggregating nor constructing proofs from the answers to the questions.
Finally, work on few-shot claim verification is limited.Lee et al. (2021) explore using a perplexity score, however, their approach is constrained to binary entailment, i.e. either supported or refuted.Zeng and Zubiaga (2023) explore active learning in combination with PET (Schick and Schütze, 2021), a popular prompt-based few-shot learning method, and Pan et al. (2021) and Wright et al. (2022) generate weakly supervised training data for zero-shot claim verification.However, none of the aforementioned methods produces (faithful) explanations.

Method
Given a claim c and a set of k evidence sentences E, the task of claim verification is to predict a veracity label ŷ and to generate a justification for the selected label.QA-NatVer is a system that returns a natural logic inference proof which consists of a sequence of relations P = m 1 , . . ., m l , each of them specifying a relation between a claim and evidence span and a NatOp operator o. 2 The sequence of operators O = o 1 , . . ., o l is then the input to a deterministic finite state automaton that specifies the veracity label ŷ = DFA(O) (c.f. Figure 1).The proof P itself serves as the justification for the predicted label ŷ.
QA-NatVer constructs its proofs following a three-step pipeline, illustrated in Figure 2: multigranular chunking of the claim and its evidence sentences and the alignment between claim-evidence spans (Sec.3.1), assignment of NatOps to each aligned pair using question answering (Sec.3.2), and a proof selection mechanism over all possible proofs by combining the answer probabilities to the questions associated with the proof (Sec.3.3).

Multi-granular Chunking & Alignment
We chunk the claim initially into l non-overlapping consecutive spans c = c 1 , . . ., c l , using the chunker of Akbik et al. (2019), and merge spans that do not contain any content words with their subsequent spans.To align each claim span c i with the information of the highest semantic relevance in the evidence E, we use the fine-tuned contextualized word alignment system of Dou and Neubig (2021) to first align individual words between the claim and each evidence sentence E j .These wordlevel alignments are then mapped back to the span c i to form an evidence span e ij from the sentence E j .Since multiple spans in the evidence sentences could align with c i , we measure the cosine similarity between c i and each aligned evidence span e ij , using latent span embeddings via Sentence Transformers (Reimers and Gurevych, 2019).
It is of essence that the granularity of a claim's span matches the evidence span to capture their semantic relationship correctly.Therefore, we additionally consider merging the claim chunks c 1 , . . ., c l into more coarse-grained chunks.Concretely, we concatenate m consecutive claim chunks into a single new span c i:i+m , with m being up to a length of 4. The merging process results in a total of at most q = 4 • l − 6 chunks.Additionally, we consider the claim c itself as the most coarse-grained unit and align it to evidence in E.
Consider the example in Figure 2. A system that only considers a single chunking might consider was incapable and of writing as separate phrases.However, the evidence spans aligned to these individual phrases (She also wrote four and books, respectively) do not provide enough context individually to infer their semantic relations with respect to the claim spans.However, when merged into a single chunk, their semantic relation becomes obvious, as the span was incapable of writing is negated by she also wrote four books.Hence, a more flexible variable-length chunking enables finding more semantically coherent alignments.

NatOp Assignment via QA
Each claim-evidence pair has to be assigned a NatOp, specifying their semantic relationship.We assign one out of six NatOps o ∈ {≡, ⊑, ⊒, ¬, ⇃ ↾, #}3 to each claim-evidence span.We formu-late the prediction for each NatOp o as a questionanswering task (cf.Table 1), each of which is instantiated by one or more boolean question prompts T o .Only exception is the independence operator (#) which is applied when none of the other operators are predicted.To predict whether a NatOp o holds between a claim span c i aligned with evidence e i , we compute the log probabilities averaged over all question prompts T o : with a ∈ {Yes, No}, |T | being the number of question prompts, and QA being our seq2seq instruction-tuned language model (see App.A for details).We apply an argmax function to select the most probable answer âo = argmax y QA(a | c i , e i , T o ).An answer prediction âo = Yes indicates that the NatOp o holds for the aligned claimevidence spans.This formulation enables us to effectively use of the generalization abilities of instruction-tuned language models.As illustrated in Figure 2, given the aligned claim-evidence spans was incapable of writing and She also wrote four books, we ask questions for each of the five NatOps, with the spans embedded in them.In the figure, the negation NatOp (¬) is selected due to its corresponding boolean question being positively answered.Since predictions are made independently for each NatOp, it can occur that the model predicts multiple NatOps for a single pair of aligned claim-evidence spans.In these instances, we select the NatOp with the highest probability (as computed in Eq. 1).On the contrary, if none of the five NatOps is predicted, we assign the independence operator (#).

Proof Selection
Since we expand the l initial non-overlapping claim chunks with multi-granular merging into q overlapping ones, we can construct a total of C(l) = l−1 i=l−m C(i) proofs, with C(i) being the number of proofs for i chunks, C(0) = 1, and m being the maximum merge length. 4To select the most appropriate proof we compute a score for each one, defined as the sum of a NatOp probability score s p (Eq. 2) and a NatOp verdict score s v (Eq. 3) introduced later in this section.We select the proof with the highest score.
Since the probability for each NatOp is computed independently, we define the score s p as the average of the predicted NatOp probabilities: with n being the length of the proof P , and T o being the questions for the NatOp o assigned to the i-th aligned span in the proof.Since no probability is explicitly assigned to the independence operator (#) as we have no question to directly capture it (cf.Section 3.2), we use the lowest scoring option to be added to the average in these cases (i.e.0.5 since the predictions are boolean).
The NatOp verdict score s v considers the aligned claim with its evidence in its entirety as the most coarse-grained chunk, for which our questionanswering system computes a score over a textual representation of the veracity labels y: with T v being the veracity question templates, and O being the NatOp sequence in the proof P .The score s v is defined to be the probability assigned to the veracity label associated with the state in the DFA in which the proof P would terminate in, i.e.DFA(O).In our example, the proof where was incapable and of writing are considered separate spans receive a low score due to the two independence NatOps and its termination in the Not enough info (NEI) state, while the one where they are merged in a single span is selected due to high answer confidence in the predicted NatOps.

Training
We fine-tune the word-level alignment system as well as our question-answering system for QA-NatVer.The alignment system is trained on multiple training objectives as defined in Dou and Neubig (2021) for parallel corpora, namely masked language modelling, translation language modelling, and their self-training objective.We consider the claim c and each gold evidence e ∈ E G as a sentence pair.We create further pairs using gold proofs P G by considering all possible substitutions of claim chunks with their respective evidence chunk.Our question-answering system is fine-tuned following We train all systems on 32 samples unless stated otherwise, randomly sampling from the aforementioned annotated instances.We evaluate our system on the development split of FEVER (Thorne et al., 2018), consisting of 19, 998 claims, using retrieved evidence.We use the document retriever of Aly and Vlachos (2022), and the sentence reranker of Stammbach (2021) to select the top k = 5 evidence sentences E. To assess the robustness of QA-NatVer, we also evaluate the systems on Symmetric FEVER (Schuster et al., 2019), a binary classification dataset (Supports, Refutes) consisting of 712 instances, which is built to expose models that learn artefacts and erroneous biases from FEVER.
Baselines As a state-of-the-art faithful inference system, ProoFVer (Krishna et al., 2022) is the main baseline we compare against, which is based on GENRE (Cao et al., 2021), an end-to-end entity linking model, fine-tuned on BART (Lewis et al., 2020).We evaluate a version of ProoFVer that is trained on the same data as QA-NatVer to compare both system's data efficiency.We refer to the version of ProoFVer trained on over 140, 000 FEVER instances using additional knowledge sources as outlined in Sec. 1) as ProoFVerfull.Moreover, we evaluate in our few-shot setting LOREN (Chen et al., 2022a), which decomposes claims into phrases, and predicts their veracity using a neural network regularized on the latently encoded phrases' veracity by simple logical rules.
Similarly to ProofVer, LOREN was trained on the entirety of FEVER.
We further consider few-shot baselines that do not guarantee faithfulness or provide the same level of interpretability.These include a state-of-theart few-shot learning method, T-Few (Liu et al., 2022), which uses two additional loss terms to improve few-shot fine-tuning.While Liu et al. ( 2022) is based on T0 (Sanh et al., 2022), we instead use BART0 (Lin et al., 2022), a BART model instruction-finetuned on the multi-task data mixture described in Sanh et al. (2022), to keep the baselines comparable.They observe comparable performance between BART0 and T0, which we confirmed in preliminary experiments.Finally, we also evaluate a finetuned DeBERTa LARGE model (He et al., 2021), and a DeBERTa model additionally fine-tuned on SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018), both being common and very competitive claim verification baselines (Stammbach, 2021;DeHaven and Scott, 2023).
Experimental Setup We are sampling K training samples and do not consider a validation set for hyperparameter-tuning, following the real-world few-shot learning setting of Alex et al. (2021).We use Awesomealign (Dou and Neubig, 2021) as the word-level contextualized alignment system.To stay comparable with our ProoFVer and T-Few baselines, we also use BART0 as our instructiontuned language model.Notably, natural language inference is not a task BART0 has seen during instruction fine-tuning.To take advantage of a more powerful QA system, we also evaluate QA-NatVer using Flan-T5 (Chung et al., 2022), a state-of-the-art instruction-tuned language model, which has explicitly seen natural language inference amongst many more tasks.Results are averaged over five runs with standard deviation indicated unless otherwise noted.Training DeBERTa on NLI datasets improves robustness substantially, performing even better than ProoFVer-full.Using Flan-T5 as our instructiontuned model instead, QA-NatVer surpasses all other models, improving scores by about 4.9 accuracy points to reach an accuracy of 85.8 ± 0.9.
Varying sample sizes.Claim length.Real-world claims are typically longer than the example claim shown in Figure 1.MultiFC (Augenstein et al., 2019), a large-scale dataset of real-world claims, has an average claim length of 16.7 tokens compared to FEVER with 9.4.We subsequently measure the performance of QA-NatVer as a function of a claim's minimum length, as shown in Figure 3. QA-NatVer shows only a very small performance decline for claims of up to a minimum of 18 tokens, indicating its robustness towards longer claims, correctly predicting the veracity of claims such as "The abandonment of populated areas due to rising sea levels is caused by global warming".Ablation We perform three different ablation studies reported in Table 5.First, we examine the performance of QA-NatVer without multi-granular chunking.We observe a 7.7 average score drop in accuracy, demonstrating that considering evidence spans at different levels of granularity improves performance.Second, we ablate our proof selection method by omitting the veracity score s v , observing an accuracy drop by 3.7 points.Finally, we compare our question-answering approach for NatOp assignments to a model that is prompted to predict NatOPs for claimevidence spans as a multi-class problem without multi-granular chunking.We use ChatGPT (Ope-nAI, 2022) and Llama-2 (Touvron et al., 2023) as state-of-the-art few-shot language models and prompt them with in-context examples from our annotated proofs.The other parts of QA-NatVer are kept identical.We observe that the non-QA approach leads to predicting more independence operators (#), resulting in an 11.9 points drop in accuracy with ChatGPT, and a 16.6 drop with Llama2-13B.For details see Appendix D.  (Nørregaard and Derczynski, 2021), a three-way claim verification task for Danish, consisting of a total of 6, 407 instances.To eliminate additional variability from a multilingual retrieval system, we use the gold evidence for evaluation, except for NEI-labeled instances for which we retrieve evidence via BM25.
Baselines & Experimental Setup: We use our baselines with multilingual backbones, namely T-Few, and ProoFVer with an mT0 backbone (Muennighoff et al., 2022), as well as a finetuned XLM-RoBERTa (Conneau et al., 2020) model.We additionally consider ProoFVer-full by translating the claim and evidence from Danish into English, using the translation system by Tiedemann and Thottingal (2020).QA-NatVer also uses mT0 (Muennighoff et al., 2022), a multilingual T5based model, instruction-tuned on multilingual tasks.Similarly to BART0, mT0 has not seen natural language inference in its fine-tuning.We use the chunking system of Pauli et al. ( 2021) and keep all other components unchanged.The language of the natural logic questions and answers remains English for all experiments.
Results Results on DanFEVER are shown in Table 6.Our system achieves accuracy and F 1 of 61.0 and 56.5, outperforming all other baselines by 1.8 and 2.7 points, respectively.The ProoFVer baseline trained on the same data as our model achieves a score of 47.9.Notably, in this setting our approach even outperforms ProoFVer-full, where the claims and evidence being translated from Danish into English.Running ProoFVer-full in this setting is computationally expensive due to the translation required and still has worse accuracy than QA-NatVer.The variability in this language transfer setup is higher than for FEVER, particularly for T-Few, but remains low for QA-NatVer.

Correctness of Natural Logic Operators
Assessing the quality of generated proofs exclusively by the verdict they result in, ignores that an incorrect proof might lead to the correct verdict.For instance in Figure 4, ProoFVer fails to assign equivalence (≡) even to identical spans, such as Highway and Highway, yet it still produces the correct veracity label.To intrinsically evaluate the quality of proofs, human subjects (not the authors of this paper) annotated a total of 114 NatOp assignments from 20 claims and their associated proof from both ProoFVer and QA-NatVer for their correctness.Each NatOp assignment was annotated by 3 annotators, resulting in 342 data points.The claims are selected via stratified sampling, ensuring that each class is equally represented.We further ensure that both models predict the same verdict.All three subjects assigned the same correctness label to a NatOp in 84.8% of cases, thus indicating high inter-annotor agreement.QA-NatVer's NatOp assignments are correct in 87.8% of the cases, while ProoFVer is only correct in 63.4%, indicating that the quality of NatOp assignments by QA-NatVer is superior to those by ProoFVer.
Considering the very high label accuracy of ProoFVer (outperforming QA-NatVer by almost 10 accuracy points), these results are surprising.We hypothesise that ProoFVer might have learned "shortcuts" to arrive at the correct verdict in its proof due to the noisy signals in the weakly supervised proof dataset it has been trained on, due to the dataset-specific heuristics that have been applied to construct a dataset of sufficient size to train it on.To validate this, we inspect mutations where the claim and evidence spans are identical.These are trivial cases where the model is expected to predict the equivalence operator.However, ProoFVer produces a wrong NatOp for about 16.3% of cases, mostly the independence operator (13%), but our system always predicts equivalence (see App. C).

Plausibility
To assess the plausibility of the natural logic proofs predicted by QA-NatVer, we run a forward prediction experiment (Doshi-Velez and Kim, 2017).Human subjects are asked to predict the veracity label solely from the justification (i.e.proof) generated by the model and to specify on a five-level Likert scale, ranging from very plausible to not plausible, how plausible the justification appears to them.Since we are evaluating proofs as an explanatory mechanism to humans, we ensured that no subject was familiar with the deterministic nature of natural logic inference.To enable non-experts to make use of the proof, we replaced the NatOps with English phrases, similar to (Krishna et al., 2022) The evaluation consists of 120 annotations from 6 subjects.The same 20 claims used in the human evaluation of correctness are paired with a ProoFVer or QA-NatVer proof explanation and are annotated by three subjects.No subject annotates the same claim for both models, as otherwise a subject might be influenced by the explanation it has seen before for the same claim.Using the QA-NatVer proofs, subjects correctly predict the model's decision in 90% of cases, compared to ProoFVer's 76.9%.All three subjects selected the same verdict in 70% and 91.7% of the cases, for ProoFVer and Qa-NatVer, respectively, with an inter-annotator agreement of 0.60 and 0.87 Fleiss κ (Fleiss, 1971).Regarding the plausibility assessments, the subjects rate QA-NatVer proofs an average score of 4.61, while ProoFVer is rated 4.16 out of 5 points.

Efficiency
QA-NatVer remains computationally efficient since the typical bottleneck of transformer models, the input and output length, remain short at every stage of the pipeline.Concretely, the alignment module encodes each evidence sentence with the claim independently.The QA model uses as input a question with a single embedded claim span and its evidence with the output being either Yes/No or a short phrase.The average input length to the QA model on FEVER is 20 tokens while its output is in most cases a single token.This is substantially cheaper computationally than cross-encoding the claim and all evidence sentences and autoregressively generating the proof at once, as done by ProoFVer, with 195.2 input and 31.1 output tokens on average.The entire runtime of the QA module can be described as O(l • n 2 span + n 2 all ), with l being the number of spans, n span being the input length for the aligned claim-evidence spans (for the NatOp probability score) and n all being the length of the claim and its evidence sentences (for the NatOp verdict score).We measure the wall-clock time (in minutes) with the BART-large backbone, using the same hardware configuration as described in Appendix B. DeBERTa,ProoFVer,LOREN,22.3,21.4,27.5,and 36.4 minutes, and inference on the FEVER development set of 19998 instances runs in 20. 6, 7.3, 185.2, 116.5, and 89.1 minutes, respectively.

Conclusion
This paper presented QA-NatVer, a natural logic inference system for few-shot fact verification that frames natural logic operator prediction as question-answering.We show that our approach outperforms all baselines while remaining faithful.Human evaluation shows that QA-NatVer produces more plausible proofs with fewer erroneous natural logic operators than the state-of-the-art natural logic inference system, while being trained on less than a thousandth of the data, highlighting QA-NatVer's generalization ability.Future work looks at extending the capability of natural logic inference systems to more complex types of reasoning, including arithmetic computations.

Limitations
While natural logic provides strong explainability by operating directly on natural language, it is less expressive than alternative meaning representations that require semantic parses such as lambda calculus (Zettlemoyer and Collins, 2005).For instance, temporal expressions and numerical reasoning are beyond the expressive power of natural logic (Krishna et al., 2022) but are frequently required when semi-structured information is available (Aly et al., 2021).Moreover, cases of ambiguity like cherrypicking, are difficult to process with natural logic.Addressing the limits of natural logic inference systems is out-of-scope for this paper.Similar to ProoFVer, the proof we constructed is intended to be executed in the DFA from left to right, however, natural logic-based inference is not constrained to such execution.Furthermore, all benchmarks explored in the paper use Wikipedia as the knowledge base which is homogeneous compared to heterogeneous sources professional fact-checkers use (e.g., news articles, scientific documents, images, videos)

Ethics Statement
Our system improves the explainability of claim verification models and empowers actors to make more informed decisions about whether to trust models and their judgements, yet actors must remain critical when evaluating the output of automated claim verification systems and not confuse explainability with correctness.We emphasize that we do not make any judgements about the truth of a statement in the real world, but only consider Wikipedia as the source for evidence to be used.Wikipedia is a great collaborative resource, yet it has mistakes and noise of its own similar to any encyclopedia or knowledge source.Thus we discourage users using our verification system to make absolute statements about the claims being verified, i.e. avoid using it to develop truth-tellers.
Listing 1: The instructive part of the prompt used in ChatGPT and Llama2 ablation experiments.For each NatOp, we provided a short explanation and 5 examples (colored in blue) formatted as {evidence,claim,natop} triples.
the sentence transformer we use for alignment is sentence-transformers/all-mpnet-base-v2.In addition to the similarity score from the sentence transformer to select the most appropriate evidence span e i from all options e ij , we take into account the evidence retriever's ranking of the evidence sentences, down-weighting the similarity score by a factor that scales negatively with the rank, i.e. (j • 0.8).
Baselines We evaluate the T-Few baseline using the provided repository 6 , which also provided the basis for the QA-NatVer code implementation.Liu et al. (2022) also propose a parameter-efficient finetuning method (named (IA) 3 ) in addition to the loss functions, however, we observe much more stable training loss and better results in preliminary experiments when tuning the entire model, particularly for BART0.This observation of degrading performance on BART0 with (IA) 3 is also observed in Aly et al. (2023).Therefore, for all experiments, and all models (baselines + QA-NatVer) we trained all model weights.Krishna et al. (2022) kindly 5 https://github.com/neulab/awesome-align 6https://github.com/r-three/t-fewprovided us access to their ProofVER model.For our ablation experiment with ChatGPT, We use OpenAI's API (Brockman et al., 2020) to query gpt-3.5-turbo-0613,and ask ChatGPT to assign NatOPs to batches of claim-evidence pairs (at most 25 spans per query).For ablation experiments with Llama2, we ran 13B parameter models locally.We used the GPTQ (Frantar et al., 2022) version of these models with 4-bit quantization to lower the computational requirements and speed up the inference.
Hyperparameters Since no development data was used to tune hyperparameters, we set them to the default values described in Liu et al. (2022).We only reduced the learning rate for QA-NatVer as we noticed that the training loss was unstable.Specifically, we set the hyperparameters to learning_rate: 1e-5, batch_size: 8, grad_accum_factor: 4, training_steps: 2000, use the AdamW optimizer (Loshchilov and Hutter, 2019), and a linear decay with warmup scheduler.The same hyperparameters are used with Flan-T5.

C Human Evaluation
All subjects in the human evaluation are graduate/postgraduate students in either computer science or linguistics.3 subjects are male, 3 female.None of the subjects had prior knowledge of natural logic inference.Table 7 shows the textual description used for the NatOps in the human evaluation and Table 8 shows the predicted NatOps by both ProoFVer and our system for aligned claim-evidence spans that are identical.

D Prompting
Listing 1 shows a template for the instructive part of our prompts with ChatGPT and Llama2.
For ChatGPT, we used OpenAI's API (Brockman et al., 2020) to query gpt-3.5-turbo-0613,and asked ChatGPT to assign NatOPs to claimevidence pairs.We used batches of at most 25 spans per query to lower the running costs.For Llama2 experiments, we ran the models locally and asked the models to assign one NatOP per query.

Claim:Figure 1 :
Figure1: At each inference step, a claim span is mutated into an evidence span via a natural logic operator (NatOp).The current veracity state and mutation operator determine the transition to the next state, via a fine state automaton (DFA).Starting at S , the span Anne Rice is mutated via the equivalence operation (≡), resulting in S , according to the DFA.The inference ends in R , indicating the claim's refutation.We use question-answering to predict the NatOps, taking advantage of the generalization capabilities of instruction-tuned language models.
Liu et al. (2022), by optimizing the maximum likelihood estimation (cf.Eq. 1), complemented by an unlikelihood loss which discourages the model from predicting incorrect target sequences.The NatOps in the gold proofs P G are used as positive QA samples, while we sample negative training instances (i.e.NatOp questions with the answer being "No") from the gold proofs by randomly selecting a wrong NatOp for an aligned claim-evidence span.

Table 2 :
Results on FEVER with 32 claim annotations for training.

Table 3 :
Results on Symmetric-FEVER.Few by 7.2 points.Our system is also competitive with models trained on the entirety of FEVER, including ProoFVer-full, performing 1.2 accuracy points worse than ProoFVer-full (82.1).
Robustness As shown inTable 3, QA-NatVer performs competitively against all baselines when run without adjustment on Symmetric FEVER.QA-NatVer outperforms ProoFVer by 28.7 accuracy points and T-

Table 4
compares our approach against the baselines when trained with varying amounts of data in a few-shot setting, ranging between 16, 32, and 64 samples.QA-NatVer consistently outperforms our baselines across sample sizes.Notably, while DeBERTav3 sees improvements with 64 samples, ProoFVer's improvements are marginal, indicating a larger need for data.The variability decreases across all models with increasing sample size, with QA-NatVer having the lowest standard deviation, indicating its training robustness.QA-NatVer with Flan-T5 achieves a sore of 71.6 ± 1.7 when trained with 64 samples.The question-answering formulation might also be beneficial to obtaining large-scale proof annotations (cf.Section 2 cheaply, reducing the workload for the annotators.Investigating this option is left to future work.

Table 5 :
Ablation study of QA-NatVer on FEVER.Highway to Heaven is something other than a drama.Evidence: Highway to Heaven is an American television drama series which ran on NBC from 1984 to 1989.FEVER example where ProoFVer and QA-NatVer reach the correct verdict (refutation).QA-NatVer produces more plausible proofs with fewer erroneous NatOp assignments than ProoFVer.
4.2 Application to Lower-Resource LanguageData To assess QA-NatVer's performance for languages with fewer resources than English, we evaluate it without further training (apart from the 32 FEVER annotated English claims) on DanFEVER Claim:

Table 8 :
Predicted NatOps for claim and evidence spans that are identical.Despite the triviality, ProoFVer does assign the wrong operator in about 16.3% of cases.In contrast, our system does consistently predict equality in every instance.