LM vs LM: Detecting Factual Errors via Cross Examination

A prominent weakness of modern language models (LMs) is their tendency to generate factually incorrect text, which hinders their usability. A natural question is whether such factual errors can be detected automatically. Inspired by truth-seeking mechanisms in law, we propose a factuality evaluation framework for LMs that is based on cross-examination. Our key idea is that an incorrect claim is likely to result in inconsistency with other claims that the model generates. To discover such inconsistencies, we facilitate a multi-turn interaction between the LM that generated the claim and another LM (acting as an examiner) which introduces questions to discover inconsistencies. We empirically evaluate our method on factual claims made by multiple recent LMs on four benchmarks, finding that it outperforms existing methods and baselines, often by a large gap. Our results demonstrate the potential of using interacting LMs for capturing factual errors.

In this work, we take a different approach to this problem, motivated by truth-seeking mechanisms in law.Specifically, we consider the setting where a witness is cross-examined in order to check whether their statement is factually correct or not.In such a setting, the examiner asks questions that aim to lead towards contradictory statements by the witness, and a contradiction implies that the witness lied at least in some of the statements.Hence the well known quote "Were you lying then or are you lying now?" (Wilder et al., 1957).
To employ this mechanism to LM factual calibration, we propose the following setting, illustrated in Figure 1.Our goal is to check whether a statement made by an LM ("Augustus was the first Roman Emperor to sport a beard") is factually correct.We refer to the model that generated this statement as the EXAMINEE.To check whether this fact is correct, we use another LM, called EXAMINER, to conduct a cross-examination of EXAMINEE.Concretely, we craft designated prompts to facilitate a multi-turn interaction between the two LMs, where EXAMINER issues questions (e.g., "Was it common for Roman Emperors to wear beards?") to EXAMI-NEE to check the veracity of the original statement.The examination is concluded by a decision from EXAMINER as to whether the original claim was correct or not. 1ur problem setting is related to that of calibration (Guo et al., 2017), where the goal is to predict the probability at which a model will err.However, unlike previous approaches to this problem, we use text generated by LMs.Our approach is motivated by the intuition that calibration is actually an elaborate reasoning process where one checks the level of support that a fact has, based on other statements the model believes.We argue that such complex reasoning is naturally performed via the strong conversational skills of modern LMs.
We use our method to detect errors in LM generation in the context of factual question-answering.

Augustus was the first Roman Emperor to sport a beard.
There are no known historical records or paintings that depict Augustus with a beard.
1. Augustus was the first Roman Emperor.2. A Roman Emperor was a political leader who held supreme authority in the Roman Empire.
3. No, it wasn't common for Roman Emperors to wear beards.4. The previous rulers before Augustus were known as Roman Republic's leaders, such as Julius Caesar 1. Who was Augustus?2. What is the definition of a Roman Emperor? 3. Was it common for Roman Emperors to wear beards? 4. Who were the previous Roman Emperors before Augustus?the claim that Augustus was the first Roman Emperor to sport a beard is incorrect.In fact, it was not common for Roman Emperors to wear beards, and Augustus was known for promoting a clean-shaven look.Additionally, there are no known historical records or paintings that depict Augustus with a beard.
Are there any historical records or paintings that depict Augustus with a beard? Figure 1: An example of our LMVLM approach.The first line shows the statement made by the EXAMINEE LLM.Then an interaction between the EXAMINER and EXAMINEE takes place, and the EXAMINER arrives at a conclusion whether the original statement was correct or not (here it concludes that it was a false statement).
Our experiments with several recent LMs ( CHAT-GPT, GPT-3 and LLAMA) show that crossexamination effectively detects factually incorrect claims generated by LMs.Specifically, across multiple datasets and examination settings, it detects over 70% of the incorrect claims while maintaining a high precision of >80%, outperforming strong baselines by a large gap.
Further analysis shows that examiner LMs introduce multiple questions throughout the examination, and employ various strategies to reveal inconsistencies, including question paraphrasing, validation of implicated arguments, claim decomposition, and requests for evidence.
To conclude, our contributions are (a) framing the task of factuality testing as an interaction between two LMs, (b) proposing a concrete implementation of this interaction via the use of one LM with different prompts in a zero-shot setting, and (c) demonstrating improved factuality detection accuracy across several benchmarks.

LM Cross-Examination
Our goal is to employ an "examiner" LM (EXAMINER) to evaluate claims generated by another LM (EXAMINEE).To this end, we leverage the recent success of prompting (Liu et al., 2023b), to facilitate a cross-examination setting between the two LMs.In such a setting, EXAMINER should introduce questions with the objective of revealing inconsistencies with respect to an initial claim made by EXAMINEE.Such inconsistencies can be considered as a signal for uncertainty of EXAMI-NEE in its original claim, and thus can be used to assess whether its original statement was correct.
Given an EXAMINER LM and a claim C generated by an EXAMINEE, our method establishes a multi-turn interaction between the LMs, where at each turn the other LM is prompted with a designated prompt that incorporates the outputs from previous turns.This interaction continues until the examiner has no further questions and can provide its final decision.To establish a meaningful interaction that reveals possible inconsistencies, we define three stages for the examination, each guided by a specific prompt.As part of each prompt for EX-AMINEE or EXAMINER, we provide the outputs generated in the previous rounds for context.We next describe the examination stages in detail, with the overall process illustrated in Figure 2.
Stage 1: Setup The examination begins by "assigning" the EXAMINER its role.Namely, describing the task setting, providing it with the EXAM-INEE's claim, and asking it to generate questions for the EXAMINEE. 2ext, we feed the questions generated by EXAM-INER, one at a time, to EXAMINEE, concatenated to the following instructions: Please answer the following questions regarding your claim.The response from EXAMINEE yields a set of answers to the questions from EXAMINER.
Stage 2: Follow-up Questions We next feed EX-AMINER with the answers generated by EXAMINEE to its initial questions, and ask EXAMINER whether it has any follow-up questions.Notably, outputs from EXAMINER at this stage are conditioned on  the previous output from EXAMINEE.If the answer from EXAMINER is "Yes", we then further prompt it to obtain more questions.This phase is conducted iteratively, until either EXAMINER declares it has no follow-up questions, or the number of turns has reached a threshold.3 Stage 3: Factuality Decision Once no further questions are obtained from EXAMINER, we prompt it to conclude whether the claim C is true or false.Specifically, we request it to reply with either "correct" or "incorrect" as its final conclusion.In cases where the examiner does not output either of "correct" or "incorrect", we consider its final decision to be a rejection of the claim. 4Typically though, we observe that the examiner follows the instructions and indeed generates a definitive conclusion (see statistics in §5).

Related Work
Attribution and Fact Checking Our goal is closely related to works that check if LM-generated texts are faithful to a given source text (Bohnet et al., 2022;Honovich et al., 2022).This problem has been addressed via several approaches, including question generation (Wang et al., 2020;Honovich et al., 2021;Scialom et al., 2021), NLI (Thorne et al., 2018;Welleck et al., 2019;Maynez et al., 2020;Dziri et al., 2022;Gao et al., 2022;Kamoi et al., 2023), data augmentation (Atanasova et al., 2022;Wright et al., 2022;Gekhman et al., 2023), and planning schemes that allow the model to self-edit its own generation (Schick et al., 2022).Unlike these works, we are not assuming any reference text or external knowledge bases.Instead, we directly check if the LM's claim is likely to be correct, by probing the model for inconsistencies.
Our approach also uses multi-turn dialogue as a key component.
Model Calibration A key challenge with prediction models is to provide a probability of the answer being incorrect, a problem known as model calibration (Guo et al., 2017).The problem of factual-error detection can be viewed as a variation of calibration, where instead of a continuous probability, we provide a binary prediction for whether the model is correct or not.This is also related to the setting of selective prediction, where models can abstain from answering a query (Varshney et al., 2022;Kamath et al., 2020).Common approaches to calibration are to perform various transformations on model logits (Desai and Durrett, 2020;Jiang et al., 2021), and measuring uncertainty (e.g., see Kuhn et al., 2023).More recent works have studied the use of LMs for providing calibration, by training them on statements known to be factually correct or incorrect.This "supervised" approach has been explored via fine-tuning (Kadavath et al., 2022;Lin et al., 2022) and in-context learning (Cohen et al., 2023;Alivanistos et al., 2022).We focus on zero-shot factual error detection that involves two categories: predicting whether a model's claim is correct or incorrect.While we focus on a binary setting, one could envision extensions of our approach to continuous outputs (i.e., the probability that the claim is correct).
Multi-Agent LMs Using multiple LMs in an interactive manner is a relatively new idea with many potential applications.It has been shown that LMs can utilize additional LMs or tools to enhance safety or better solve downstream tasks (Amodei et al., 2016;Irving et al., 2018;Barnes and Christiano, 2020;Schick et al., 2023).Additionally, Park et al. (2022) showed that in a social setting, LMs demonstrate certain social skills that emerge from this interaction, and Shinn et al. (2023) proposes that an LM can use a different model to instruct it when to "reflect" on its recent action, while performing a planned sequence of actions aimed at solving a given query.Intuitively, this model detects signs of hallucination or inefficient planning within the LM's trajectory.
Consistency Across Generations LMs have been shown to generate inconsistent outputs given different prompt paraphrases (Elazar et al., 2021;Newman et al., 2021).Prior work showed that prompts can be automatically optimized to produce factually correct claims more robustly (Lester et al., 2021;Zhong et al., 2021;Qin and Eisner, 2021).Hao et al. (2022) utilized multiple generated paraphrases to gauge consistency (Hao et al., 2022), and other works (Elazar et al., 2021;Zhou et al., 2022) further proposed training objectives to improve model consistency.Another approach to handling multiple outputs is via variants of decoding strategies (Wang et al., 2022), or model ensembles (Sun et al., 2022).In our work, we build on these, assuming inconsistencies are more likely to occur with incorrect claims, and let an examiner model search for these by introducing questions to the examinee.

Chain of Thought Reasoning Recent work has
shown that LMs can be prompted to elaborate on their reasoning process and that this could be exploited to improve mathematical, multi-hop and common-sense reasoning skills (Wei et al., 2022;Press et al., 2022;Yoran et al., 2023), along with planning and problem-solving abilities (Huang et al., 2022;Long, 2023).Another interesting approach to complex reasoning in LMs is recent work on Maieutic prompting (Jung et al., 2022), that answers a question by recursively generating a set of facts and reasoning over those.
Our approach may be viewed as constructing an elaborate chain-of-thought explanation for the examinee's claim.However, we do not train this explanation via in-context or fine-tuning, and rather rely on different prompts for its generation.

Experiments
In this section, we conduct experiments on multiple datasets and models to evaluate our approach, focusing on the task of factual question-answering.

Experimental Setup
Factual Question Answering One key use-case of LMs is answering questions seeking factual knowledge.For example, "How old was Barack Obama when he was first elected?".In such cases, it is crucial for the model to answer the question correctly, or to indicate that it does not know the answer.We thus evaluate our approach on several Question Answering and Fact Completion datasets.These are typically provided as a set of (Q, A) pairs of a question Q and its ground-truth answer A.
Having gold answers allows us to evaluate if a predicted answer is factually correct or not, which can be used to evaluate our LMVLM approach.
To apply cross-examination in this setting, we first convert the answer predicted by the model into an EXAMINEE claim that can be provided as input to the examination procedure.Formally, given a question Q, if Q is phrased as a fill-in-theblank question (e.g."Bailey Peninsula is located in ____"), then we feed it to the EXAMINEE model to obtain a prediction that completes the sentence and forms a claim.In cases where Q is phrased as a question (e.g., "Where is Bailey Peninsula located?"), we prompt the model to provide an answer in a claim format with: "Please answer the following question: <Q> Please phrase your answer as a claim.".This process results in a claim C that states the model's "belief" about the answer to Q.We then evaluate the truthfulness of C through cross-examination, and compare the examiner's decision of whether C is correct or not to the ground-truth correctness.
Factuality Evaluation Labels To evaluate our method, it is necessary to have "gold decisions" to compare the examiner's decisions against.Such labels can be obtained from the ground-truth answers in the data.Namely, the decision for a claim C is correct if it matches an evaluation of C against the gold answer A. To evaluate if the claim C obtained for a question Q is correct with respect to the ground-truth answer A, we first check if A or any of its aliases (if provided as part of the dataset, e.g., "FC Tottenham" and "Tottenham Hotspur") appears as a sub-string in C (Schick et al., 2023;Meng et al., 2022).Next, to avoid incorrect labels resulting from this automatic evaluation (Bulian et al., 2022), we manually review all the claims marked as incorrect in the first step, and fix any labeling mistakes.We also filter out any ambiguous or unclear claims generated by EXAMINEE.
Examiner Evaluation We evaluate how well the examiner detects claims that are factually incorrect, using the following metrics:5 • Precision: the portion of incorrect claims, out of the claims rejected by the examiner.• Recall: the portion of incorrect claims rejected by the examiner, out of all the incorrect claims.• F1: the harmonic mean of precision and recall.
For completeness, we additionally report (in §E) the complementary Precision, Recall, and F1 scores with respect to detection of correct claims.
Data We consider the following datasets: LAMA (Petroni et al., 2019), TriviaQA (Joshi et al., 2017), Natural Questions (NQ) (Kwiatkowski et al., 2019) and PopQA (Mallen et al., 2022).These cover a wide range of queries, from real user queries (NQ), to trivia questions (TriviaQA), and subject-relationobject facts phrased as queries (LAMA, PopQA).We consider the closed-book open-ended setting, where we do not provide any context or answer choices to the model.We evaluate our approach on 1,000 random examples from the test set (or from the development set if a test set is not available). 6n addition, we created a dataset of false claims to further test our approach.This "Falsehoods" dataset contains only wrong claims, created separately for each model (GPT-3 and CHATGPT) and for each of the four QA datasets.Concretely, given a model and a question Q, we prompt the model to generate a false answer (see §C for details).We verify that these are indeed incorrect claims by checking that the gold answer (and any of its aliases, if they exist) does not occur in the generated text.This yields a subset of examples that are realistic, namely, the answer matches the target type (e.g., "a city") but is incorrect (see examples in Table 2).The examiner's decision for these examples should always be to reject.Models We use CHATGPT (gpt-3.5-turbo),GPT-3 (text-davinci-003) (Brown et al., 2020;Ouyang et al., 2022), and LLAMA-7B (Touvron et al., 2023), in three EXAMINER vs. EXAMINEE cross-examination settings: GPT-3 vs. GPT-3, CHATGPT vs. CHATGPT, and CHATGPT vs. LLAMA.Notably, using the same LM as EXAM-INER and EXAMINEE (except for their prompts, which are different), provides a cleaner setting where both LMs share the same knowledge.The prompts used for each LM at every stage of the examination are shown in Table 10.
Baselines For each setting, we compare LMVLM with recent methods for uncertainty detection and variants of our approach: • Confidence-Based: The prediction head of LMs outputs a probability for the predicted token.
It is a common practice to use this probability as a measure of confidence in the prediction (Yoshikawa and Okazaki, 2023).In our case, the LM generates a multi-token claim, and we calculate the confidence for the claim as the product of probabilities for all predicted tokens of the answer only.In order to output a binary decision (i.e., is the claim correct or not), we optimize a threshold over the train dataset to maximize F1.Note that our examination approach does not require tuning any threshold.• Are you sure? (AYS): Recent work (Kadavath et al., 2022;Cohen et al., 2023) has shown that LMs can be trained to estimate their certainty in generated facts.Here, we use a zero-shot version of this approach where we directly "ask" the model whether it is sure.Specifically, we add the following prompt right after the claim generation: Notably, this baseline requires labeled data for the in-context demonstrations, which is not necessary for our approach.• LMVLM: A single execution of our method, where we accept or reject the claim according to the examiner's final decision.• LMVLM (Majority): For a given claim, we apply our method three times (with the same EXAMINER and EXAMINEE), using sampling generation for follow-up questions generation.We reject the claim in case at least two of the examinations concluded it is false.Since output probabilities are not provided as part of the CHATGPT's API, we cannot provide results for the Confidence-Based baselines in this case.Moreover, we observe that CHATGPT often fails to understand the task of IC-IDK.

Results
Tables 3, 4, 5 show the results for the the following pairs of EXAMINEE vs EXAMINER: CHATGPT vs. CHATGPT, GPT-3 vs. GPT-3, and LLAMA vs. CHATGPT, respectively.We do not include results for LLAMA as an EXAMINER since it did not work well, possibly because it is less specialized for instruction following.
Across all settings, our method outperforms the baselines, often by a large gap.For example, it ob- # of follow-up iterations 1.9 ± 1.2 1.3 ± 0.7 1.6 ± 1.0 # of questions per iteration 3.1 ± 2.1 2.7 ± 1.6 2.9 ± 1.9 % of inconclusive examiner decisions 14.8% 9.1% 10.3% tains 85.4 F1 compared to ≤ 65.2 by baselines for CHATGPT on PopQA (Table 3), and 77.2 F1 compared to ≤ 60.1 for GPT-3 on TriviaQA (Table 4).Notably, the most substantial gains are in terms of recall, showing the superiority of our method in detecting factually incorrect claims (when compared to the baselines which achieve reasonable precision too).Interestingly, we observe that CHATGPT generally outperforms GPT-3.
Last, Table 6 shows the accuracy of our method and baselines on our Falsehood dataset.For both CHATGPT and GPT-3, LMVLM successfully rejects a large majority of the false claims, obtaining 87%-98% accuracy with CHATGPT and 75%-99% with GPT-3 across all datasets.

Ablations
Follow-Up Removal we remove the follow-up iterations in the examination process to gauge their benefit.Results are reported for GPT-3 in Table 4 (last row), showing a large decrease in performance (e.g.78 → 68.3 in F1 for NQ and 77.2 → 71.1 for TriviaQA).Notably, recall scores are decreased by 6%-10%.Overall, this shows the importance of the follow-up questions issued by the examiner to assess the examinee's claim.
Retrieval-Augmented LMs Language models can be significantly strengthened when augmented with additional retrieved data (Khandelwal et al., 2019;Borgeaud et al., 2022;Zhong et al., 2022;Guu et al., 2020).We next perform an evaluation of LMs in this setting as well.We focus on augmenting the EXAMINEE, since this is the model that can presumably benefit from additional information when answering questions.Specifically, we used DPR (Karpukhin et al., 2020) for retrieval and took the top three passages it retrieved from Wikipedia, concatenated them, and instructed a GPT-3 EXAMINEE (in the prompt) to answer the question based on the passages.
This resulted in improved accuracy of the EX-AMINEE (i.e., the fraction of questions answered correctly) from 50.1 to 66.4 when augmented with DPR.Table 9 reports results on factuality detection when using GPT-3 as EXAMINER.It can be seen that our approach still outperforms the baselines, as in the case without augmentation.
Comparing the performance of the retrievalaugmented EXAMINEE, to the non-augmented one (see Table 4, NQ columns), we see that augmenting the EXAMINEE with retrieval leads to a substantial increase in precision (79.3 → 87.5) with only a small decrease in recall (76.8 → 75.7) and an overall improvement of 2.9% in F1.This shows that, alongside the improvement of the EXAMINEE,  LMVLM still performs well in detecting factual errors when they occur.

Analysis of Cross-Examinations
We analyze cross-examinations by GPT-3 and CHATGPT to better understand the success and failure cases of our method.We find that examiner LMs typically ask multiple questions in the examination, and perhaps surprisingly, seems to apply different strategies to reveal inconsistencies.We note that furhter analysis is required in order to better understand whether the EXAMINER indeed utilizes the conducted examination in its factuality decisions.
Examination Statistics Table 7 provides statistics on the cross-examinations performed by CHAT-GPT and GPT-3.Both models introduce multiple queries (6-7 on average) during an examination, with typically 1-2 steps of follow-up questions, which are important for the examiner's decision ( §4.3).We also observe a non-negligible number of claims (9%-15%) where the examiner LM does not arrive at a concrete final decision (i.e., it does not generate "correct" or "incorrect" as the final decision.We reject the claim in those cases).In our qualitative analysis, we identify reasons that could explain these cases.We note that in most cases where LMVLM fails, EXAMINEE provides incorrect information to EX-AMINER.This may indicate that in those cases EXAMINEE encodes a large set of factually wrong facts that are mutually consistent, thus making it hard for EXAMINER to detect inconsistencies.Finally, the fact that CHATGPT more commonly validates the claim via logical questions might be a key factor in its superiority over GPT-3 in our setting.

Conclusion
We introduce LMVLM, a method for zero-shot detection of factuality errors.Our method uses prompting to facilitate multi-turn interactions between an two LMs, to reveal inconsistencies that imply factually incorrect claims.LMVLM builds on a fundamental connection between selfconsistency (i.e., consistency of an LM with itself) and factual consistency (i.e., consistency between claims generated by an LM and ground-truth facts).We consider the LM as the source of information, and we test whether a claim it has generated is consistent with other beliefs it has.
Our work can be extended in several ways.First, LMVLM provides interpretable information about related beliefs of the model, which could be analyzed to understand what makes the model commit certain mistakes.Second, one may incorporate several LM instances into the factuality detection process, rather the having only a single EXAMINER.Finally, one can train the EXAMINER to generate questions more effectively.

Limitations
We note three limitations of our LMVLM method.First, it requires multiple queries of the examinee and examiner LMs, which could be costly when using external APIs such as those used in this work.This could be a key consideration when scaling this approach to large numbers of claims.
Second, for our method to succeed, both LMs (EXAMINEE and EXAMINER), but mostly EXAM-INER, should be able to follow instructions and have the ability to reason over information in a relatively long context.This skill is currently mostly demonstrated by larger models (>10B parameters) and thus our method may not perform as well for smaller models.
Last, any logical flaws in the examiner's operation are likely to affect the overall examination, potentially leading to inaccurate decisions.However, our experiments show that, even if such flaws occur, our method is still useful on average as it substantially improves factuality detection.Nonetheless, developing safety mechanisms that detect and mitigate logical flaws is an important research direction, that we leave for future work.

A Prompts
Table 10 provides the prompts used in our LMVLM approach.

B Additional Evaluation
We follow the same experimental setting as in §4, but evaluate performance with respect to acceptance of claims rather than rejection.Specifically, here we use the following definitions: • Precision: the portion of correct claims, out of the claims accepted by the examiner.• Recall: the portion of correct claims accepted by the examiner, out of all the correct claims.
In addition, we introduce an ensemble AYS + LMVLM: for a given claim, we first run the AYS method, and if the claim is rejected by this method we then apply LMVLM (Majority) to obtain a final decision.
Tables 11 and 12 shows the evaluation results for the settings of CHATGPT vs. CHATGPT and GPT-3 vs. GPT-3, respectively.
In terms of precision, our method outperforms the other baselines, often by a large gap (e.g., 81.6 compared to ≤ 60 by baselines for CHATGPT on PopQA, and 68.7 compared to ≤ 52.4 for GPT-3 on PopQA).Moreover, this is while maintaining good recall with respect to the baselines, except for AYS which has the best recall scores.

C Falsehoods Data
To generate a wrong claim, given a query Q for one of the QA datasets we use, we prompt our models the following way: in case Q is in a question format, we use "Please answer the following question with a wrong answer: <Q>" and further request the LM to "Please also phrase your answer as an argument".In case Q is in a sentence-completion format, we use "Please complete the following sentence with a wrong answer: <Q>" and further concatenate Q with the model answer.Table 13 introduces a few examples of these, generated by GPT-3.We manually verified that all the generated claims were indeed factually wrong.

Figure 2 :
Figure 2: The three-stage process of cross-examination between the EXAMINER and EXAMINEE, where the factuality of a claim C generated by EXAMINEE is estimated by EXAMINER.

Table 1 :
Portion of factually correct claims by every EXAMINEE LM on each dataset.

Table 2 :
Example false claims generated by CHATGPT for PopQA and by GPT-3 for TriviaQA.

Table 3 :
Precision (P), Recall (R), and F1 scores for LMVLM with CHATGPT as EXAMINER and EXAMINEE, compared to baselines.The last row shows an ablation of our method without the follow-up questions stage.

Table 4 :
Precision (P), Recall (R), and F1 scores for LMVLM with GPT-3 as EXAMINER and EXAMINEE, compared to baselines.The last row shows an ablation of our method without the follow-up questions stage.
"Are you sure regarding the correctness of your claim?Please answer with Yes or No".Then we take the output as the prediction whether the claim is correct or not.• I don't know (IDK): Recently, Ganguli et al. ing it on an held-out set of examples from the dataset in a zero-shot setting.Intuitively, these examples' answers are likely to be unknown to the model, hence we labeled them with "Don't know".The model predictions are either a target text or "Don't know".Based on the output, we generate a factuality label as in the IDK baseline above.

Table 5 :
Precision (P), Recall (R), and F1 scores for LMVLM with CHATGPT as EXAMINER and LLAMA as EXAMINEE, compared to baselines.The last row is an ablation of our method without the follow-up questions stage.

Table 6 :
Accuracy of GPT-3 and CHATGPT as EXAM-INER on false claims generated for each dataset.
"The first Fast and Furious film was released in 2001."In which year was the first Fast and Furious film released?"The screenwriter who is credited with writing the screenplay for Winner is Wendy Riss" 1.What is the name of the screenwriter who is credited with writing the screenplay for Winner? 2. Who is credited with writing the screenplay for Winner?What is the birth order of the Pevensie children in C S Lewis's The Lion, the Witch and the Wardrobe?2. What are their ages?3. Who appears second in this list?How many vertices does an octahedron have?EXAMINEE: An octahedron has eight vertices, each of which is the point where three edges meet.

Table 8 :
Examples for frequent patterns of CHATGPT and GPT-3 observed through manual analysis of crossexaminations.
Wrong intermediate answers:The EXAMI-NEE responds with factually incorrect answers to one or more of the questions asked by the EXAMINER.We observe this occurs mostly in cases where the original claim is false (it happens in only in about 14% of the cases where the EXAMINEE is correct).In both models, this can be observed in about half of the cases where the claim is false and has also been detected by the EXAMINER.Furthermore, it occurs in about 80% of the cases where the EXAMINER has accepted a false claim, and in 45% where the EXAMINER has rejected a correct claim.
Algorithm 1 Cross ExaminationInput: A claim C generated by EXAMINEE Output: Correct / Incorrect Report ← "" Q ← EXAMINER(P setup , C) R curr ← EXAMINEE(C, Q) Report ← Report + Q + R curr while EXAMINER(P f ollow−up , R) is "Yes" do Q ← EXAMINER(P f ollow−up , R) R prev ← R curr R curr ← EXAMINEE(R prev , Q) Report ← Report + Q + Rcurr end while return EXAMINER(P decision , C, Report) E Example Cross-Examinations Full cross-examination examples are provided in Tables