Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements

Despite the much discussed capabilities of today's language models, they are still prone to silly and unexpected commonsense failures. We consider a retrospective verification approach that reflects on the correctness of LM outputs, and introduce Vera, a general-purpose model that estimates the plausibility of declarative statements based on commonsense knowledge. Trained on ~7M commonsense statements created from 19 QA datasets and two large-scale knowledge bases, and with a combination of three training objectives, Vera is a versatile model that effectively separates correct from incorrect statements across diverse commonsense domains. When applied to solving commonsense problems in the verification format, Vera substantially outperforms existing models that can be repurposed for commonsense verification, and it further exhibits generalization capabilities to unseen tasks and provides well-calibrated outputs. We find that Vera excels at filtering LM-generated commonsense knowledge and is useful in detecting erroneous commonsense statements generated by models like ChatGPT in real-world settings.


Introduction
We introduce VERA, a general-purpose commonsense statement verification model.This model is designed to estimate the plausibility of declarative, natural language statements based on commonsense knowledge.
We build VERA in response to the absence of good detectors of commonsense errors in text generated by language models (LMs).LMs have been advancing rapidly and have demonstrated remarkable success in various tasks, including question answering, natural language inference, sequence classification, and text generation.Yet these models still make simple commonsense mistakes.As shown in Figure 1, as of February 23, 2023, Chat-GPT (OpenAI, 2022a) reportedly output the text "since the density of a marble is much less than the density of mercury, the marble would sink to the bottom of the bowl if placed in it", which is clearly incorrect.This kind of failure raises concerns about the reliability and trustworthiness of these models (Lin et al., 2022).
VERA estimates a plausibility score for a commonsense statement based on its commonsense knowledge about the world.It contrasts with fact verification methods (Thorne et al., 2018;Wadden et al., 2020), which verify the correctness of claims based on evidence from a text corpus.VERA enables plausibility estimation where direct evidence is often not retrievable from some corpus, and usually some implicit, fuzzy reasoning is needed.It operates solely with the commonsense knowledge stored in its model parameters, and does not have a retrieval component.
VERA is built on top of T5 (Raffel et al., 2020), a generic pretrained LM, by finetuning on a vast collection of correct and incorrect commonsense statements sourced from knowledge bases (KBs) and question answering (QA) datasets.The 21 data sources (Table 5,appendix) amount to ∼7M statements encompassing a wide spectrum of domains, including general, scientific, physical, and social commonsense, as well as quantitative (reasoning about numbers) and qualitative (reasoning about qualitative relationships such as smaller) commonsense.We propose a novel two-stage training process that takes into account the scale and quality of data from different sources.In addition to the standard multiple-choice binary classification objectives, we adopt a supervised contrastive loss (Khosla et al., 2020) to magnify the distinction between similar statements with different correctness labels.Furthermore, we propose an automatic way of augmenting the training data by eliciting LMs to generate incorrect answers to commonsense questions and empirically find it helps generalization.
We evaluate VERA in the following applications: • Excelling in commonsense problems over GPT-series when repurposed for verification ( §5.1).VERA can be applied to solve multiple-choice and boolean commonsense problems when expressed in the verification format, by scoring and ranking candidate hypotheses.It substantially outperforms existing models repurposed for commonsense verification (including GPT-3.5, ChatGPT and GPT-4), improving upon the best existing baseline, Flan-T5, with absolute improvement of 6% on seen benchmarks and 4% on unseen ones.• Filtering LM-generated commonsense knowledge ( §5.2).VERA can filter noisy commonsense knowledge statements generated by other LMs, improving the effectiveness of LM-generated knowledge in downstream knowledge-augmented inferences.VERA is well-calibrated, enabling filtering at customized thresholds.• Detecting commonsense errors in ChatGPT outputs ( §5.3).Through a preliminary analysis, we find that VERA can identify commonsense errors made by ChatGPT in-the-wild, with a precision of 91% and a recall of 74%.An example of VERA in action is shown in Figure 1.
We hope VERA can be a useful tool for improving the commonsense correctness of existing generative LM output and inspire more effort toward general-purpose and robust verification methods.

Problem Definition and Scope
Our goal is to build a model that can estimate the plausibility of any given commonsense statement.The model takes as input a statement that (1) is expressed in natural language; (2) is declarative, as opposed to interrogative questions; (3) is selfcontained without requiring additional context to comprehend; (4) has an objective, binary correctness label; and (5) in principle can be labeled using widely-held commonsense knowledge about the world.Encyclopedic knowledge (e.g., Ljubljana is the capital of Slovenia.) is out of scope.Moving forward, unless explicitly noted, we use commonsense statement to refer to statements within the above scope.Though somewhat strict, this scope covers a broad range of potential applications.
For an input commonsense statement x, the model should output a real-valued score s ∈ [0, 1] that represents its estimated plausibility of x.While the gold correctness label is binary, we let the model output a score to reflect its confidence.A score of 1.0 means that it is completely confident that x is correct, and a score of 0.0 means it is completely confident that x is incorrect.When predicting correctness label from the score, we use 0.5 as the threshold.

Method
In this section, we describe the whole pipeline to build VERA.We start from curating large-scale training data including both correct and incorrect statements from diverse commonsense tasks ( §3.1).Next, we learn a scoring model that takes a statement and returns a continuous score by finetuning a LM via 3 training objectives ( §3.2).An additional post hoc calibration strategy is applied to make the output scores well-calibrated ( §3.3).

Data Construction
Labeled commonsense statements usually do not appear in text in the wild, while some commonsense question answering (QA) datasets and commonsense knowledge bases (KBs) are good sources for this kind of statements.We collect correct and incorrect commonsense statements from the above two types of data source.Table 1 shows some examples on how these statements can be converted from QA problems and KB entries.In total, we obtain ∼7M statements (for training) from 19 QA datasets ( §3.1.1)and two KBs ( §3.1.2) that encompass a wide spectrum of commonsense domains.Table 5 (appendix) lists these datasets with statistics.All datasets we use are publicly available.

From Commonsense QA Datasets
Numerous commonsense reasoning datasets have been published in recent years (Davis, 2023)

Converted statement group:
One would wear an ungulate to protect themselves from a cannon.✗ One would wear a bomber to protect themselves from a cannon.✗ One would wear body armor to protect themselves from a cannon.✓ One would wear a tank to protect themselves from a cannon.✗ One would wear a hat to protect themselves from a cannon.✗

BOOLEAN QA
Original example: Can an average dog follow an instruction manual?Answer: no Converted statement group: An average dog can follow an instruction manual.✗

Original example:
Rubber stamps provide a way to make messages stand out.

Converted statement group:
Rubber stamps provide a way to make messages stand out.✓ Arabic numbers provide a way to make messages stand out.
✗ Bandages provide a way to make messages stand out.
✗ Meat tenderizers provide a way to make messages stand out.✗ many of them are in the format of multiple-choice QA (selecting the correct answer out of a set of choices) or boolean (yes/no) QA.These can be easily converted to correct and incorrect commonsense statements.From multiple-choice QA problems, we combine the question and each answer choice to form declarative statements, which are correct when using the correct answer, and incorrect otherwise.From boolean QA problems, we convert the question into a declarative statement, and keep the original label as the correctness label.Concrete examples can be found in Table 1.
Statement groups.We refer to statements originating from the same problem as a statement group.Note that statement groups originating from multiple-choice problems contain at least two statements, of which one and only one is correct; statement groups originating from boolean problems contain only one statement, and it can be either correct or incorrect.We do conversion to declarative statements automatically.From QA datasets, we create declarative statements from QA problems using the following method: • If the problem contains a question, we convert the question and choice into a declarative statement using the question conversion model created by Chen et al. (2021).
• If the question is cloze-style, we replace the blank with the choice.
• If the question is an incomplete sentence and the choice is a continuation to it, we concatenate the question and the choice.
• If there is no question and the problem only asks to choose between some choices, we use the choice as the declarative statement.
• For boolean problems, we always use yes as the choice and create a single declarative statement for each problem.We use the original label as the correctness label of this statement.
In total, 19 commonsense QA datasets contribute ∼200k statement groups and ∼400k statements to the training set of VERA.
LM-augmented falsehoods.Existing commonsense QA datasets are mostly manually constructed or assembled from standard school exams.A model trained on these datasets might overfit specific annotation patterns from humans which may limit generalization.Therefore, we augment QA problems with LM-generated answers and construct additional incorrect statements.Specifically, for a multiple-choice question, we use a small LM to sample 50 possible answers to the question, and select the 3 least probable answers with generation probability less than 0.15 (making these unlikely to be correct answers).This threshold is chosen based on manual inspection over a small portion of examples.We observe generated answers with probability larger than 0.15 are more likely to be plausible.We create LM-augmented falsehoods for the training set of 9 commonsense QA datasets, as noted in Table 5 (appendix).

From Commonsense KBs
Commonsense KBs (e.g., Atomic2020 in Hwang et al. (2020) We choose EOS because it is capable to encode the entire input in both bidirectional encoder models (e.g., T5's encoder) and left-to-right decoder models (e.g., LLaMA).Then a linear layer projects h to a scalar logit z, followed by a sigmoid function σ(•) that transforms the logit into a score s.Formally, For brevity, we use h(x), z(x) and s(x) to refer to the representation, logit and score of an arbitrary input x.

Batching
The data we construct consists of statements belonging to different statement groups.For reasons we will describe in §3.2.3, we put all statements belonging to the same statement group into the same batch.Each batch may contain multiple complete statement groups.We denote by B G the number of statement groups and B S the number of statements in total within a single batch.We denote the statement groups as {X j } B G j=1 , and the statements as i=1 .y i ∈ {0, 1} is the correctness label of x i .

Training Objectives
The model is trained with a linear combination of three losses: a binary classification loss, a multiclass loss, and a supervised contrastive loss, L = αL bin + βL mc + γL ctr , which we describe below.
Binary classification loss.Naively, commonsense statement verification can be viewed as a binary classification task.Under this setting, the loss is Multi-class loss.We expect the model to be robust against nuances in commonsense statements.Ideally, the model should be able to recognize opposite correctness labels for a group of seemingly similar statements in surface forms, such as statements created from different choices of the same question, or perturbed from the same piece of knowledge in a KB.To achieve this goal, we treat each statement group as a multi-class classification problem, maximizing the log-likelihood of the single correct statement in the statement group after passing the logits through a softmax.Formally, , where x j * is the correct statement in X j .Note that the multi-class loss is not applicable to statement groups with only one statement (i.e., statement groups from boolean QA datasets).We empirically find that the multi-class loss indeed improves generalization towards unseen multiple-choice QA datasets as indicated in Figure 3 (appendix).
Supervised contrastive loss.It has been shown (Khosla et al., 2020) that supervised contrastive learning helps to improve model robustness and generalization against input variations.In light of this, we further adopt supervised contrastive learning on top of the input representations h.We show in Figure 3 (appendix) that the contrastive loss indeed improve generalization to unseen datasets.For each anchor statement x i in a batch, the contrastive loss aims to maximize the similarity between x i and each other statement x p that has the same correctness label as x i (i.e., positive example).
At the same time, we push apart x i and other statements x n that has opposite correctness label as x i (i.e., negative example).The supervised contrastive loss is where τ is a temperature hyperparameter, cos(•, •) refers to cosine similarity, is the index set of statements that are positive examples for x i , and is the index set of statements that are negative examples for x i .Formally,

Two-Stage Training
Since data sourced from KBs are larger in scale but more noisy than data sourced from QA datasets, we take a two-stage training approach.In training stage A, we start from a pre-trained LM and train with data sourced from KBs.In training stage B, we start from the model obtained in stage A and train with data sourced from QA datasets.During experiments we found that this setting is better than single-stage training with either data source or a mixture of the two.

Inference and Calibration
An ideal plausibility estimation model should be calibrated, that is, its confidence in its predictions should be approximately equal to the actual frequency of correctness.During early experiments, we found that VERA tends to be overconfident.Therefore, we apply a post hoc calibration on VERA's output.Following the temperature scaling method introduced in Guo et al. ( 2017), during inference we divide the model-predicted logit by a temperature T before computing the score, that is, Note that no temperature scaling is applied during model training.
With predictions on a validation set D = {(x i , y i )} D i=1 , we estimate T that gives the minimal expected calibration error (ECE) (Naeini et al., 2015) on this validation set.Equation 1 in §C.1 shows how ECE is computed.In practice, we use the combined development sets of the seen datasets ( §4.2) to estimate T , and the optimal T becomes a parameter of VERA.Note that temperature scaling does not change the relative ordering of prediction scores, and thus the other performance metrics (e.g., accuracy) are not affected (see detailed explanation in §B.2).

Experimental Setup
In this section, we provide more details of model training, the evaluation protocol and metrics, and describe the baseline models we benchmark.

Training Details
Datasets.For training stage A, we use the ∼1.6M statement groups (∼6M statements) sourced from two commonsense KBs; for training stage B, we use the ∼200k statement groups (∼400k statements) sourced from 19 commonsense QA datasets.For each training stage, we mix the training sets of all datasets together, without any re-weighting.
Models.We use two types of pretrained LMs as the backbone of VERA: (1) the encoder of T5 (Raffel et al., 2020), which is a bidirectional encoder model; (2) LLaMA (Touvron et al., 2023), which is a left-to-right decoder model.For the T5 encoder, we start from the pretrained T5-v1.1-XXL1whose encoder has about 5B parameters, and refer to the resulting model as VERA-T5.(During experiments we found that starting from Flan-T5-XXL2 performs slightly worse than starting from T5-v1.1-XXL.)For LLaMA, we start from the pretrained LLaMA-7B and refer to the resulting model as VERA-LLaMA.As we will see, VERA-T5 has better performance than VERA-LLaMA, so unless explicitly specified, when we say VERA we mean VERA-T5.See Table 8 (appendix) for the complete hyperparameter settings and §C for the implementation details.

Evaluation and Baselines
Evaluation protocol.We divide our evaluation into two parts: (1) Seen benchmarks, whose training set is used for model training.(2) Unseen benchmarks, whose training set is not used for model training.We futher divide up the unseen benchmarks into type 1 and type 2, where in type 1 benchmarks the task is similar to those in the seen benchmarks, while type 2 benchmarks are a bit further away in terms of the nature of the task.Examples of type 2 unseen benchmarks include HellaSwag which is contextualized with event descriptions, and CREAK which involves reasoning among different entities.Depending on the nature of the evaluation benchmark, we use different metrics to evaluate our model's performance.Unless explicitly said otherwise, we report performance on the development set, where the gold labels are available, and we do not use the development sets of unseen datasets for model selection.The overall metric reported over multiple benchmarks is the unweighted average of the metric over all these benchmarks, which accounts for the differently-sized evaluation sets.

Evaluation Results
In this section, we evaluate the ability of VERA to estimate the plausibility of commonsense statements and compare it with the baseline models.
We show the effectiveness of VERA in three scenarios: solving commonsense problems, filtering LM-generated commonsense knowledge, and detecting commonsense errors in ChatGPT outputs.

Solving Multiple-Choice and Boolean Commonsense Problems
The output plausibility scores from VERA can be used for solving multiple-choice and boolean commonsense problems.We first convert the problems into the statement group format ( §3.1).For multiple-choice problems, we choose the statement with the highest score in the statement group.For boolean problems, we use s = 0.5 as the threshold to predict correctness labels of statements.Table 2 reports the results when VERA is applied to solve commonsense problems.See Figure 5 and Table 9, 10, 11 (appendix) for full results including AUROC and AP.On seen benchmarks (16 multiplechoice and one boolean), VERA outperforms the best baseline, Flan-T5, by 6% on (absolute) accuracy and 9% on AUROC.VERA beats Flan-T5 by 4% accuracy and 5% AUROC on type 1 unseen benchmarks (four multiple-choice and one boolean), and by 4% accuracy and 6% AUROC on type 2 unseen benchmarks (five multiple-choice and two boolean), demonstrating good generalization.VERA-T5 has better performance than VERA-LLaMA across the board, which may be due to its bidirectional connectivity.Aside from performance, VERA also has good calibration, with ECE no higher than 3% on seen and unseen benchmarks.The post hoc calibration method improves calibration across all three parts.
Typically we may need to choose a threshold for binary classification in boolean datasets.However, we notice that a zero logit (z = 0) is generally close to the optimal decision threshold between correct and incorrect commonsense statements.Therefore we do not estimate a model-specific threshold, and simply use the default threshold: z = 0, or equivalently, s = 0.5.

Filtering LM-generated Commonsense Knowledge
Figure 2 reports the results when VERA is applied to filter LM-generated commonsense knowledge.On the two seen benchmarks, SKD_anno and I2D2_anno, VERA is a better knowledge filter than all baseline models, in terms of both AUROC and AP.In particular, on I2D2_anno it outperforms the I2D2 critic model by 2% AUROC, which is specifically trained on the I2D2_anno dataset and does not generalize well to other benchmarks.On the unseen benchmark, Rainier_anno, VERA is also comparable with the best baselines like Flan-T5 and GPT-3.5.As for calibration, the ECE is no higher than 8% on all three benchmarks.We find that filtering commonsense knowledge using VERA can greatly improve the performance of knowledge-augmented reasoning methods.In the Generated Knowledge Prompting framework (Liu et al., 2021), when solving a commonsense QA problem, first a knowledge model generates several commonsense knowledge statements rele- vant to the question, and then a QA model makes predictions based on them.A big problem that hinders the effectiveness of this framework is that model-generated knowledge is not always factual, and incorrect knowledge statements can mislead the QA model.We introduce VERA to filter these statements before passing them to the QA model.In particular, we keep those statements that receive a score higher than 0.5 from VERA.
Following Liu et al. (2022b), we use UnifiedQA-large as the QA model, and consider two knowledge models: few-shot GPT-3 (davinci) (Brown et al., 2020) and Rainier-large (Liu et al., 2022b).We follow the evaluation settings as in Liu et al. (2022b), and for few-shot GPT-3 (davinci), we use the same task-specific few-shot prompts and same process to generate silver knowledge as in Liu et al. (2022b).Results are shown in   12, appendix) show that there is increased effectiveness in every individual benchmark.

Preliminary Study on Detecting Commonsense Errors made by ChatGPT
VERA can be useful in detecting commonsense mistakes made by generative LMs in-the-wild.We collected 27 anecdotes from the Internet where people reported ChatGPT making commonsense errors, and manually rewrote them into their correct versions, obtaining 54 statements in total.
When detecting incorrect commonsense statements in this dataset, VERA has a precision of 91% and a recall of 74%, amounting to an F 1 score of 82%.Table 4 shows how VERA scores some of these these erroneous commonsense statements and their manually corrected version.In 7 out of the 9 cases, VERA assigns a low score to the original, incorrect statement, and a high score to the corrected statement.For example, "since the density of a marble is much less than the density of mercury, the marble would sink to the bottom of the bowl if placed in it" receives a score of 0.04 and is identified as an incorrect statement, whereas "since the density of a marble is much less than the density of mercury, the marble would float if placed in mercury" receives a score of 0.96 and is identified as a correct statement.Meanwhile, there are also some failure cases.VERA believes that "it is possible for a solar eclipse to be followed by a lunar eclipse the next day", and fails to reject that "it is possible to draw a diagonal line in a triangle".

Analysis
Ablations.We conduct an ablation study by incrementally removing the following components from the training process: contrastive loss ( §3.2.3), training stage A ( §3.2.4), LM-augmented falsehoods ( §3.1), multi-class loss or binary loss ( §3.2.3).Since at least one of the multi-class loss and the binary loss is needed, we remove them separately and observe the effect of training with a single loss.
Results are shown in Figure 3. Overall, the ablated components have more impact on unseen benchmarks than seen ones.Removing the contrastive loss hurts performance mostly on unseen datasets, implying that the contrastive objective is beneficial for generalization.Removing training stage A hurts performance across the board, emphasizing the importance of training with largescale commonsense knowledge.LM-augmented falsehoods are most helpful on unseen benchmarks, with a little sacrifice in the performance on seen benchmarks.The multi-class loss is most helpful on multiple-choice benchmarks, while removing the binary loss substantially hurts performance on boolean benchmarks.
Scaling Trends of VERA.We trained variants of VERA that are based on smaller versions of the T5 encoder, and show the results in Figure 4. Model performance increases steadily with size, and does not show evidence of saturation at 5B parameters, suggesting that better commonsense plausibility estimation models might be yielded from larger pretrained LMs.Format: Verification vs. QA.In this paper, we focus on the verification format to solve commonsense problems.A comprehensive discussion on how this format compares with the QA format is provided in §E and Figure 7.

Related Work
Commonsense verifiers.Prior work has explored the idea of verifying commonsense statements.SYMBOLIC KNOWLEDGE DISTILLATION (West et al., 2021) and I2D2 (Bhagavatula et al., 2022) train models to classify the acceptability of model-generated commonsense statements.The ENTAILER (Tafjord et al., 2022) model is partially trained to score the validity of a given hypothesis.These models are trained on relatively small-scale, domain-specific data and do not generalize well to broader commonsense domains.Some other work uses pretrained LMs with few-shot prompting to verify commonsense statements (Kadavath et al., 2022;Jung et al., 2022).In this work, we develop a general-purpose commonsense statement verifier that works out-of-the-box in zero-shot setting.
Verification in other tasks.Beyond commonsense statements, the problem of verification has been extensively studied on various NLP tasks.NLI (Liu et al., 2019(Liu et al., , 2022a;;Zhang et al., 2017) can be viewed as an entailment verification task.Chen et al. ( 2021) presents a method for QA verification by transforming the context passage and question-answer pair into a premise-hypothesis format as in NLI.Some work build models to perform reasoning verification -classifying whether a premise supports or refutes a hypothesis (Bostrom et al., 2022;Sprague et al., 2022;Yang et al., 2022;Tafjord et al., 2022).On the other hand, fact verification (Thorne et al., 2018;Wadden et al., 2020) requires judging the validity of claims against a corpus of evidence (e.g., Wikipedia).These tasks feature context-sensitive or knowledge-intensive hy-potheses to verify and are typically complemented with additional context.In contrast, we focus on verifying standalone commonsense statements where no context is required or provided.
Generation vs. verification.With the rapid progress in generative LMs, researchers have been largely building general-purpose problem-solving methods with a generative approach (Khashabi et al., 2020(Khashabi et al., , 2022;;Lourie et al., 2021;Tafjord and Clark, 2021;Wei et al., 2022).However, current generative LMs are still prone to hallucination errors and lack an intrinsic mechanism to express confidence level on their outputs.Verification, on the other hand, shows promise to complement these shortcomings and has been adopted to improve the outcome of generation (Chen et al., 2021;Jiang et al., 2022).In this work, we take a pure verification approach and build a general-purpose verifier for commonsense statements, which to our best knowledge is the first of its kind.

Conclusion and Future Work
We introduced VERA, a general-purpose verification model for commonsense statements and an early step toward tools for mitigating commonsense errors in text generated by language models.VERA achieves state-of-the-art performance when solving commonsense problems in the verification format, excels at filtering LM-generated commonsense knowledge statements, and is found useful in detecting erroneous commonsense statements from generative LMs.Furthermore, the scores produced by VERA are well-calibrated; and could be used for plausibility score estimation for declarative statements if needed.As VERA mainly targets on singlesentence statements, future work may consider verification of multi-sentence or long-form statements, or contextualized/defeasible commonsense statements.

Limitations
VERA aims, and is trained, to predict the plausibility of statements based on objective commonsense knowledge of our world.It is not intended to handle text outside the scope of commonsense statements (e.g., encyclopedic facts, reading comprehension with fictional worlds).It is not trained or evaluated on moral commonsense data, so its capability of making moral predictions is unknown.It gives a prediction even if the input falls out of its intended scope, which could be mitigated by an additional scope guard to determine its applicability.In addition, it is not trained to handle very long and compositional input.Although greatly outperforming existing systems, VERA is not perfect and may make incorrect predictions.It is not very robust under syntactic variations of the input, such as paraphrases and negations.As the training data may contain bias or toxicity, VERA may also make predictions that are perceived as ethically problematic.
The output of VERA does not reflect the authors' view.VERA is a research prototype, and it is not designed for making real-world decisions.
knowledge distillation: from general language models to commonsense models.In North American Chapter of the Association for Computational Linguistics. Thomas

A More Details on Datasets
Table 6 shows more dataset statistics, and Table 7 shows the dataset citations and links from which we retrieved the datasets.

A.1 Dataset-Specific Special Handling
For some datasets, we pre-process them into a unified multiple-choice or boolean format.We provide the details below.
Com2Sense (paired).Com2Sense contains true and false statements that can be paired into complements.To utilize this pairing information, we place the two statements in each pair into the same statement group, and treat this as a multiple-choice dataset.Some statements in the dev set are not paired, so we discarded these examples.
CycIC (mc).CycIC contains both multiplechoice and boolean QA problems.To keep consistency in evaluation, we use only the multiplechoice problems, which is the dominant problem type in this dataset.
ComVE (task A).ComVE contains data for three tasks.Task A is assigning true/false labels to paired statements, similar to Com2Sense (paired).Task B and C are about choosing and generating explanations to a given statement being against commonsense.We use the data for task A.

SKD (annotated).
The annotated dataset of Symbolic Knowledge Distillation (SKD) contains LM-generated, semi-structured knowledge triples, where the head and tail events are connected by relations, such as (PersonX doesn't like to wait, xIntent, to get the job done).
Following West et al. (2021), we replace the name placeholders with random person names, and convert into natural language statements using templates adapted from Hwang et al. (2020).For example, the triple in the above example becomes Arnold doesn't like to wait.Because Arnold wanted to get the job done.
We set the correctness label to be true iff the valid field has a positive value.
I2D2 (annotated).The annotated dataset of I2D2 contains LM-generated commonsense statements with human-annotated correctness labels.We use the combination of annotated data in "Iter0" and "Iter2", because the data of "Iter1" is missing from the website.

A.2 Conversion to Declarative Statements
From QA datasets, we create declarative statements from QA problems using the following method: • If the problem contains a question, we convert the question and choice into a declarative statement using the question conversion model created by Chen et al. (2021).
• If the question is cloze-style, we replace the blank with the choice.
• If the question is an incomplete sentence and the choice is a continuation to it, we concatenate the question and the choice.
• If there is no question and the problem only asks to choose between some choices, we use the choice as the declarative statement.
• For boolean problems, we always use yes as the choice and create a single declarative statement for each problem.We use the original label as the correctness label of this statement.

B More Details on Method B.1 Training Objectives
Binary classification loss.We defined the binary classification loss as To account for the fact that there are usually more incorrect statements than correct ones in the data produced from multiple-choice datasets, we divide this loss by the number of statements with the same correctness label in the same statement group.Therefore, the binary classification loss for the whole batch is , where C j is the number of statements in statement group X j , x jc is the cth statement in X j , and I is the indicator function.
Multi-class loss.We defined the multi-class loss as .
The multi-class loss for the whole batch is Supervised contrastive loss.We defined the supervised contrastive loss as The supervised contrastive loss for the whole batch is L ctr (x i , y i ).

B.2 Calibration
Our calibration is a post-hoc strategy and does not affect the task performance metrics we report in §5.This is because applying our calibration methodtemperature scaling -does not affect the relative order of plausibility scores assigned to a given set of statements: • For tasks with multiple-choice questions ( §5.1), calibration does not affect the argmax prediction for the above reason.
• For commonsense knowledge filtering ( §5.2), calibration does not affect the TPR/FPR numbers at each corresponding decision point, again for the above reason, so the ROC curves are valid.
• For True/False judgment problems ( §5.1 and §5.3), calibration does not move the plausibility scores across the decision boundary.We use logit z = 0.0 (or equivalently, plausibility score s = 0.5) as the True/False boundary.A positive (or negative) logit remains positive (or negative) after applying the temperature.

C More Details on Experimental Setup
Table 8 shows the hyperparamter settings for training VERA.These values are obtained from some moderate hyperparameter tuning, and we did not do extensive search due to training cost.For tokenization, the T5 tokenizer tokenizes input so that it ends with the EOS token </s> (token ID = 1).We manually configured the LLaMA tokenizer so that its output ends with the EOS token </s> (token ID = 2), and does not contain the BOS token <s> (token ID = 1).Models are trained for S = 50k steps with B G = 64 statement groups per batch, using the Adam optimizer (Kingma and Ba, 2014) with learning rate η = 1 × 10 −5 for T5 encoder and η = 2 × 10 −6 for LLaMA.We train models with the Huggingface Transformers and Accelerate libraries (Wolf et al., 2019;Gugger et al., 2022).For memory efficiency, during training, each statement is truncated to 128 tokens (which can accommodate more than 99% of the statements; see Table 6) and each statement group is capped to four statements.

C.1 Definition of Metrics
Multiple-choice accuracy.For multiple-choice benchmarks, we report the multiple-choice accuracy: Boolean accuracy.The boolean accuracy is defined as Boolean accuracy is applicable to balanced boolean benchmarks where there are roughly equal true and false statements (e.g., CommonsenseQA 2.0, Spatial Commonsense, StrategyQA, CREAK).Generally it is not a good metric for multiplechoice benchmarks and unbalanced boolean benchmarks.
AUROC and AP.For unbalanced boolean benchmarks (e.g., LM-generated knowledge filtering datasets), accuracy may not faithfully capture the model's performance.Instead, the metrics we use are the area under the ROC curve (AUROC) and the average precision (AP) for selecting the True statements.Statements are ranked based on their assigned raw scores, so that different score thresholds can be selected to construct the ROC and Precision-Recall curves.Aside from unbalanced boolean benchmarks, AUROC and AP are also applicable to multiple-choice and balanced boolean benchmarks.
Calibration.To measure how well the verifierpredicted score reflects its confidence, we measure the ECE (Naeini et al., 2015) on the boolean benchmarks.ECE is computed as where M is the number of bins which bucket data points with similar predictions, and B m ⊆ D is the subset of data points that fall into the m-th bin.
We use M = 10 equal-sized bins when computing ECE.
C.2 Details on Baseline Models SKD Critic.West et al. (2021) trained a critic model that filters incorrect commonsense knowledge generated by their symbolic knowledge distillation (SKD) method.This critic model is based on RoBERTa-large (Liu et al., 2019) and is finetuned on 8k GPT-3-generated commonsense knowledge sentences with human-annotated true/false labels.
The model predicts a [0, 1] score s which we use as the final score, and we let the logit z = σ −1 (s).
I2D2 Critic.Bhagavatula et al. (2022) trained a critic model that filters incorrect commonsense knowledge generated by their I2D2 method.This critic model is based on RoBERTa-large (Liu et al., 2019) and is finetuned on 12k I2D2-generated commonsense knowledge sentences with humanannotated true/false labels.Given an input statement, the model predicts two logits: t for the True label and f for the False label.We let the logit z = t − f and the score s = σ(t − f ).We use the critic model trained in the final iteration (i.e., "Iter 2" in I2D2 3 ).
UnifiedQA-v2.UnifiedQA-v2 (Khashabi et al., 2022) is a general-purpose QA model trained on datasets with a variety of input formats, including boolean datasets.When the input is a declarative statement, the model is trained to output either "yes" 3 https://gengen.apps.allenai.org/ or "no".We use this feature of the model and make it act as a commonsense statement verifier.For an input statement, we compute the logits received by "yes" and "no" in the decoder, denoted as t and f , respectively.We let the logit z = t − f and the score s = σ(t − f ).We use the largest version of this model, UnifiedQA-v2-11b. 4ntailer.Entailer (Tafjord et al., 2022) is a model trained to construct proof trees for scientific commonsense hypotheses.This multi-angle model can be used in three ways: (1) given a hypothesis, generate a set of premises that may entail it; (2) given a hypothesis, predict a score that reflects the model's belief in it; (3) given a hypothesis and set of premises, predict a score that reflects whether there is a valid entailment between them.We use (2) as a commonsense statement verifier.The model predicts a [0, 1] score s which we use as the final score, and we let the logit z = σ −1 (s).We use the largest version of this model, Entailer-11b.5 GPT-3.5.GPT-3.5 (OpenAI, 2022b) is a series of general-purpose autoregressive decoder-only LMs.
To make it act as a commonsense verifier, we use the following input prompt: Question: Based on commonsense knowledge, is the following statement correct?Please answer yes or no.

Statement: {statement}
Answer: We query the OpenAI Completions API6 with this prompt and compute the logits received by " Yes" and " No" in the next-token prediction, denoted as t and f , respectively.We let the logit z = t − f and the score s = σ(t − f ).We experimented with several prompt formats and found the one presented above to have the best performance, and in most cases, " Yes" and " No" together receive most of the probability mass during next-token prediction.
We also experimented with several models in the GPT-3 (Brown et al., 2020) and GPT-3.5 series, and found GPT-3.5 (text-davinci-002) to work the best.
Additionally, we report a baseline where the (negated) language modeling perplexity is used for commonsense plausibility.Note that the plausibility scores derived this way are not normalized, and we only use them for ranking purposes.For this baseline, we use GPT-3.5 (text-davinci-002) as the base model, and name it as "PPL (GPT-3.5)".
ChatGPT and GPT-4. ChatGPT (OpenAI, 2022a) and GPT-4 (OpenAI, 2023) are optimized for chat.To make them act as a commonsense verifier, we use the same input prompt as for GPT-3.5, without the "Answer:" line.We query the OpenAI Chat API7 with this prompt in a user message, and obtain the first token of the assistant message in the response.Besides this zero-shot setting, we additionally report a few-shot chain-of-thought (Wei et al., 2022) setting with 5 in-domain examples, formatted as additional user-assistant message pairs prior to the query user message.
Since the API does not provide token logits, we let the score s = 1.0 when this token is "Yes", and s = 0.0 when this token is "No".In the unlikely case that this token is neither, we let s = 0.5.We add a small random noise to the score.This is to arbitrate potentially multiple positive predictions within statement groups from multiple-choice QA problems, and to enable plotting the ROC and precision-recall curves.Note that this is not an ideal solution and may cause under-estimation of ChatGPT and GPT-4's performance.
Flan-T5.Flan-T5 (Chung et al., 2022) is a series of sequence-to-sequence LMs instruction-finetuned on massive number of tasks.To make it act as a commonsense verifier, we use the same input prompt as for GPT-3.5.We compute the logits received by "yes" and "no" in the first token prediction in the decoder, denoted as t and f , respectively.We let the logit z = t−f and the score s = σ(t−f ).
We experimented with several prompt formats and found the one presented above to have the best performance, and in most cases, "yes" and "no" together receive most of the probability mass during the token prediction.We use the largest version of this model, Flan-T5-XXL.8Note that some unseen benchmarks are in the training data of Flan-T5; see Table 7 for details on data contamination.

D More Evaluation Results
Figure 5 is an expansion of Table 2 and additionally shows the precision-recall curves on problemsolving benchmarks.Table 9, Table 10, and Table 11 show the per-dataset breakdown of the accuracy numbers in Figure 5. Figure 6 is an expansion of Figure 2 and additionally shows the precisionrecall curves on knowledge-filtering benchmarks.
Table 12 shows the per-dataset breakdown of the accuracy numbers in Table 3.

E Further Analysis
Format: Verification vs. QA.In this paper, we have been using the verification format to approach problem-solving tasks.But do we lose something when compared to using the QA format

Figure 1 :
Figure 1: VERA estimates the correctness of declarative statements.Example adapted from a contribution made by Henry Minsky to Marcus and Davis (2023) on February 23, 2023.

Figure 2 :
Figure 2: Results for filtering LM-generated commonsense knowledge with VERA.We plot the calibration curve for both the uncalibrated version (w/ faded color) and calibrated version (w/ saturated color) of the VERA model.Results on the development sets are reported.See Figure 6 for full results.
Since the density of a marble is much less than the density of mercury, the marble would float if placed in mercury.0.96 ✓ 2023/02/25Both a house and a pound of feathers weigh the same, which is one pound.0.25 ✗ A house weighs more than one pound, while a pound of feathers weighs one pound.0.87 ✓ Table 4: Examples of commonsense mistakes made by ChatGPT, and how VERA can detect them.In each section, the first line is the original, incorrect commonsense statement in ChatGPT's output, and the second line is the authors' manually corrected version of the statement.Each statement is followed by VERA's score and predicted correctness label.Examples are adapted from Venuto (2023); Marcus and Davis (2023); Borji (2023).

Figure 3 :
Figure 3: Ablation results.Average accuracy on the development sets is reported.Components are incrementally removed from the training process, except for the multi-class loss and the binary loss; the hierarchy is indicated in the legend.

Figure 5 :
Figure5: Results on problem-solving with VERA on seen and unseen benchmarks.Average results on the development sets are reported.Accuracy across different parts (seen, unseen (type 1), unseen (type 2)) are not directly comparable due to different underlying benchmarks.For calibration curves, curves with saturated colors are results after applying post hoc calibration ( §3.3), while curves with faded colors are results from the raw logits.

Figure 6 :
Figure 6: Results for filtering LM-generated commonsense knowledge with VERA.Results on the development sets are reported.

Figure 7 :
Figure 7: Comparing verification and QA, the two different formats for problem-solving tasks.Average accuracy on the development sets of the seen multiple-choice benchmarks is reported.We use text-davinci-002 as GPT-3.5 here, and gpt-3.5-turbo-0301as ChatGPT.VERA in QA format actually means a T5 model finetuned on the same seen multiple-choice data as VERA.

Table 1 :
Conversions from original commonsense QA problems and knowledge base entries to statement groups that are used for training.

Table 3 .
Applying knowledge filtering with VERA increases the

Table 3 :
(Liu et al., 2021)cing VERA into the Generated Knowledge Prompting pipeline(Liu et al., 2021).The QA model is UnifiedQA-large, and the generator is either GPT-3 (davinci) or Rainier-large when applicable.Average accuracy on the development set is reported; see Table12(appendix) for detailed results.
It is possible for a solar eclipse to be followed by a lunar eclipse the next day.0.86 ✓ It is impossible for a solar eclipse to be followed by a lunar eclipse the next day.0.48✗2023/01/06Thetime it takes for a given number of cars to travel a fixed distance is directly proportional to the number of cars.0.26 ✗ The time it takes for a given number of cars to travel a fixed distance is invariant of the number of cars.If two cats can eat two cans of food in a minute, then it would take six cats to eat three cans of food in a minute.0.05 ✗ If two cats can eat two cans of food in a minute, then it would take three cats to eat three cans of food in a minute.0.67 ✓ Since the density of a marble is much less than the density of mercury, the marble would sink to the bottom of the bowl if placed in it.

Table 5 :
Datasets and statistics.Data sourced from commonsense KBs are listed under STAGE A TRAINING, and data sourced from commonsense QA datasets are listed under STAGE B TRAINING.The number in parentheses under the Format column represents the number of choices per question.The Aug column indicates whether LM-augmented falsehoods are generated for each dataset.The last three columns are the number of total, correct and incorrect statements in the development set.See Table6for more dataset statistics, and Table7for full citations and sources for these datasets.

Table 6 :
More dataset statistics.This table shows the percentiles of statement lengths (as in number of T5 tokens) in each dataset.

Table 7 :
More dataset details.We show the link from which we retrieved each dataset, and whether each dataset is included in the training data of Flan-T5.