SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables

Current scientific fact-checking benchmarks exhibit several shortcomings, such as biases arising from crowd-sourced claims and an over-reliance on text-based evidence. We present SCITAB, a challenging evaluation dataset consisting of 1.2K expert-verified scientific claims that 1) originate from authentic scientific publications and 2) require compositional reasoning for verification. The claims are paired with evidence-containing scientific tables annotated with labels. Through extensive evaluations, we demonstrate that SCITAB poses a significant challenge to state-of-the-art models, including table-based pretraining models and large language models. All models except GPT-4 achieved performance barely above random guessing. Popular prompting techniques, such as Chain-of-Thought, do not achieve much performance gains on SCITAB. Our analysis uncovers several unique challenges posed by SCITAB, including table grounding, claim ambiguity, and compositional reasoning. Our codes and data are publicly available at https://github.com/XinyuanLu00/SciTab.


Introduction
Scientific fact-checking is a crucial process that involves validating the accuracy of scientific claims by cross-referencing them with established scientific literature, research, or data (Guo et al., 2022).This process is crucial for preserving the integrity of scientific information, preventing the spread of misinformation, and fostering public trust in research findings.However, the sheer volume of scientific data and claims can be overwhelming for manual fact-checking, making automated scientific fact-checking an imperative research area of NLP.
However, these datasets still exhibit several limitations.First, the claims are crowd-sourced rather than collected from real scientific papers.This leads to problems such as bias in human annotation, a lack of diversity, and shallow claims that do not reflect the complexity of scientific reasoning.For example, most claims in Sci-Fact can be validated by a single sentence in a paper's abstract, which oversimplifies the scientific discourse.Second, the claims in the existing benchmarks are solely validated against text-based evidence, primarily paper abstracts.However, in many scientific processes, claims are intrinsically tied to quantitative experimental data, commonly presented in tables and figures.This disparity highlights a significant gap between the existing benchmarks and real-world scientific fact-checking needs.To bridge these gaps, a dataset that 1) compiles real-world claims from scientific papers, and 2) includes original scientific data such as tables and figures, is needed.
In this paper, we propose a novel dataset SC-ITAB, which fulfills these stated criteria.It contains 1,225 challenging scientific claims, each demanding compositional reasoning for verification using scientific tables.Our data is derived from the Sci-Gen dataset (Moosavi et al., 2021), a resource that includes scientific tables and claims crawled from arXiv.org.We first manually filter out the checkworthy scientific claims from the raw data.Following this, we employ a strategy of human-model collaboration, as depicted in Figure 2, to generate claims that are either contradicted or unverifiable based on the table's content.Figure 1 shows a claim from SCITAB and the corresponding reasoning process to verify it.Compared with existing benchmarks, SCITAB is closer to real-world scientific fact-checking in terms of more realistic claims and table-based evidence.Through data analysis, we further show that the claims in SCITAB necessitate a more comprehensive and nuanced set of reasoning skills for verification, e.g., numerical rea-

Supported Claim
Refuted Claim A's productivity of 57.5% expresses that it appears in 7.5% more often than expected by random chance.
A's productivity of 57.5% expresses that it appears in 9.5% more often than expected by random chance.
Paper: When Choosing Plausible Alternatives, Clever Hans can be Clever

Not Enough Info Claim
The low performance of "to" can be explained by the fact that it is responsible for only 4.6% of the inference in the training set.
Paper ID: 1911.00225v1 Claim: A's productivity of 57.5% expresses that it appears in 7.5% more often than expected by random chance.
Productivity corresponds to Prod.column A's productivity is 57.5% The number of random chance is 50% The subtraction result between 57.5% and 50% is 7.5% Fact checker : Supported

Subtraction
Commonsense knowledge

Closed-domain knowledge: Table caption
Figure 1: An example of our SCITAB dataset (left) and its corresponding reasoning graph (right).Each data entry contains paper name, paper id, table, one claim, and its corresponding label (Supported, Refuted, Not Enough Info).
soning and commonsense knowledge, etc.
We employ SCITAB as a diagnostic dataset for benchmarking the zero-shot and in-context learning performance for a wide range of state-of-theart models, including table-based pretraining models, encoder-decoder models, open source language models, and API-based language models.We observe that all models, with the exception of GPT-4, can only achieve marginally superior F 1 scores than random guessing, which underscores the challenging nature of SCITAB.Additionally, established prompting methods like Chainof-Thought (Wei et al., 2022) and Program-of-Thought (Chen et al., 2022) which typically enhance performance across most reasoning tasks, do not bring performance gain on SCITAB.Our error analysis sheds light on several unique challenges in SCITAB that may lead to this, such as table grounding, dealing with ambiguous claims, and compositional reasoning.We make our dataset fully accessible to the research community.

The SCITAB Dataset
We adopt a human-model collaboration strategy to construct SCITAB, as shown in Figure 2. We describe the steps involved in data preparation (Section 2.1), automatic claim generation (Section 2.2), and manual claim verification (Section 2.3).

Data Preparation
We use the publicly available SciGen (Moosavi et al., 2021) dataset as our primary data source.
The dataset was created by crawling computer science papers from arXiv.The tables and the texts explaining the tables are extracted from the papers to create (table, description) pairs for the task of data-to-text generation.From all the table descriptions of SciGen, we first filter the check-worthy scientific claims following the criteria established by Lee et al. (2009) for academic writing1 .We focus on the descriptions that serve the purpose of "highlighting and commenting on key data", i.e., describing research findings based on the data presented in scientific tables.Given the task's objective nature and to save the cost of human labor, we hire a graduate student majoring in computer science to manually select scientific claims based on the aforementioned criteria using the user interface in Appendix A.2.This decision was based on a pilot annotation which showed that a well-trained annotator can achieve over 95% accuracy in filtering scientific claims.To safeguard the quality, we include an option to mark the claim as "Discard-It's not a claim, or it's an incomplete, or not grammatically correct sentence."during the subsequent claim verification process.Using this approach, we filtered out 872 real-world scientific claims from 1,301 table descriptions in the SciGen dataset.

Automatic Claim Generation
False Claims.A fact-checking dataset requires both true and false claims.However, acquiring false claims that naturally occur within well- Unverifiable Claims.To construct a more challenging dataset, we also integrate claims that are unverifiable with the table information (labeled as Not Enough Info, NEI).We leverage InstructGPT to generate candidate NEI claims by prompting the model with the original table and the instruction: Please generate 5 relevant scientific claims based on the information in the table.This process yields a diverse set of free-formed claims that enrich the diversity of SCITAB.However, as LLMs tend to generate content that might not always be grounded in the provided data, many of the generated claims turn out to be relevant but unverifiable with respect to the table.We adopt manual verification (elaborated in Section 2.3) to select them as NEI claims.

Manual Claim Verification
We subsequently employ a human verification process for two purposes: first, to verify the quality of the 872 false claims and 900 NEI claims that were generated by InstructGPT; second, to critically review the 872 real-world scientific claims obtained in Section 2.1.This task involves selecting claims that can be verified exclusively based on the information presented in the table, without the need for additional context from the associated paper.
For each pair of the true claim c and its corresponding generated counter-claim c ′ , we ask the annotator to choose one of the following three options: (A) c is not exclusively supported by the table, (B) c is exclusively supported by the table, but c ′ is not refuted by the table, and (C) c is not exclusively supported by the table, and c ′ is not refuted by the table.For each candidate NEI claim, we ask the annotator to judge whether it is unverifiable with respect to the table.
Annotator Recruitment.Given that our data source is from computer science papers, we recruit university students majoring in computer science with basic math and programming backgrounds for annotation.We ask each annotator to fill in a questionnaire, including their age, department, maximum workload per week, etc.After that, we provide a training session to ensure they understand the task and can use the annotation interfaces (Appendix B.2 and B.3).We also give them three samples to test their understanding.We recruit twelve annotators that passed the training session.(Chen et al., 2020), FEVEROUS (Aly et al., 2021), and SEM-TAB-FACTS (Wang et al., 2021).The table presents statistics related to the domain, annotator (AMT represents Amazon Mechanical Turk), maximum reasoning hops, veracity labels percentage of each dataset, the total number of claims, and average claims per table.
In compliance with ethical guidelines, we ensure fair compensation for the annotators.Each claim annotation is reimbursed at a rate of 0.37 USD, resulting in an hourly wage of 11.2 USD2 .
Quality Control and Annotator Agreement.To ensure the quality of the annotation, we apply strict quality control procedures following the guidelines outlined in the Dataset Statement (Bender and Friedman, 2018).We assign two different annotators to perform a two-round annotation for each claim, while two authors review and resolve any identified errors or issues.To measure the interannotator agreement, we use Cohen's Kappa (Cohen, 1960).Our inter-annotator agreement is 0.630 for the false claim verification task (872 claims in total) and 0.719 for the NEI claim verification task (900 claims in total).Both values indicate substantial agreement among the annotators.

Data Analysis
Table 1 shows the statistics of our SCITAB dataset and the comparison with three existing table factchecking datasets: TabFact (Chen et al., 2020), FEVEROUS (Aly et al., 2021), and SEM-TAB-FACTS (Wang et al., 2021).Compared with these datasets, SCITAB is 1) annotated by domain experts rather than crowd-sourced workers, 2) contains more challenging claims that require up to 11 reasoning steps for verification, and 3) has a more balanced distribution of veracity labels and a higher percentage of NEI claims.We conduct a more in-depth analysis of SCITAB as follows.

Reasoning Analysis
Reasoning Types.To study the nature of reasoning involved in fact-checking claims in SCITAB, we adapt the set of table-based reasoning categories from INFOTABS (Gupta et al., 2020) to define 14 atomic reasoning types, as shown in Table 2.Among them, "closed-domain knowledge" and "open-domain knowledge" are specially designed for SCITAB.Closed-domain knowledge refers to obtaining background information from the table caption or title, e.g., knowing that "Prod."refers to "Productivity" from the table caption in Figure 1.Open-domain knowledge refers to commonsense knowledge not presented in the table, e.g., the relationship between precision and recall.
Given the designed reasoning types, we manually analyze 100 samples in SCITAB, by annotating the graph of reasoning steps for verifying each claim.We identify 476 atomic reasoning steps from the 100 analyzed samples and show the proportion for each reasoning type in Table 2.We observe that SC-ITAB has a multifaceted complex range of reasoning types and a high proportion of claims requiring different types of domain knowledge.
Reasoning Depth.We further measure the reasoning depth (the number of required reasoning steps) for each claim and show the reasoning depth distribution in Figure 3.We find that the analyzed claims have an average depth of 4.76 and a maximum depth of 11.Moreover, 86% of the claims requiring 3 or more reasoning steps, which demonstrates the complexity of reasoning in SCITAB.
Reasoning Graph.We showcase the reasoning graph for the example in Figure 1 on the right side of the figure.Verifying this claim requires various types of reasoning including: 1) background knowledge from the table caption: "productivity" corresponds to the "Prod."column in the table; 2) commonsense knowledge: "random chance" means 50% accuracy; 3) simple lookup: "A's productivity" refers to the cell located at the last row and 12.1 Open-domain knowledge Extract additional information required by domain experts.5.3 Commonsense knowledge Extract commonsense knowledge necessary for claim verification.

Subtract
Perform subtraction of two numbers.

Divide
Perform division of two numbers.

Rank
Determine the rank of a set of numbers.

Different / Same
Determine if two numbers are different or the same.

Add
Calculate the sum of two numbers.

Max / Min
Retrieve the maximum or minimum number from a set of numbers.

Col / Rowname
Retrieve the column or row name from the table.

Trend same/different
Determine the trend for two columns or rows, whether they are the same or different.

Set check
Verify if a value belongs to a set of numbers.2.9 Table 2: The function names, descriptions, and their proportions in our SCITAB dataset.the "Prod."column; and 4) numerical reasoning: the difference between 57.5% and 50% is 7.5%.This case study provides further insights into the complexity and variety of reasoning involved in SCITAB, revealing the difficulty of the dataset.

Refuted and NEI Claims Analysis
One potential risk of model-generated claims is that they may lack diversity and exhibit the same pattern.For example, in the Sci-Fact (Wadden et al., 2020) dataset where the refuted claims are generated by flapping the meaning of the original true claims, we found that out of 100 randomly sampled refuted claims, 85 simply negated the original claim by adding negation words such as "not" (more details in Appendix C).To evaluate the diversity of claims for our SCITAB dataset, we randomly select 60 refuted claims and then manually annotate their reasons for refutation.Results are shown in Table 3 (top half).We find that SCITAB exhibits a greater diversity in refuted claims compared to Sci-Fact.Besides common error types such as "incorrect calculation results" (41.7%), there are also unique types of errors that are more reflective of the complexities in real-world scientific claims.For example, 33.33% of the refuted claims contain "incorrect approximation words", and 10.0% are cases where "the claim is partially right", consistent with the fact that ambiguity and half-truths are common phenomena in scientific discourse.Additional examples of refuted claims are in Appendix E. The NEI claims (bottom half; Table 3) also exhibit diverse reasoning patterns.The two most common features for unverifiable claims are insufficient evidence in the table and the lack of background knowledge.The lack of closed-domain knowledge is another reason for NEI, where additional information in the paper is necessary to verify the claim.Other reasons include the use of vague pronouns (e.g., "it", "this") brings ambiguity to the claim.These distinct refuted and NEI reasoning types highlight the unique features of SCITAB, making it a more comprehensive and realistic representation of the challenges faced in real-world scientific fact-checking.

Experiment
We formally define the task of scientific table-based fact-checking as follows.A scientific table T consists of a table caption P and the table content ({T i,j |i ≤ R T , j ≤ C T } with R T rows and C T columns, where T i,j is the content in the (i, j)th cell.Given a claim C describing a fact to be verified against the table information in T .
Considering the real-world situation that largescale training data is either not available or expensive to collect, we focus on the zero-shot/in-context evaluation where the model can only access zero/few in-domain data from SCITAB.To this end, we randomly hold out 5 tables with 25 claims as model-accessible data and use the rest of the data as the unseen test set.This also prevents the model from learning spurious features that lead to overestimated performance (Schuster et al., 2019).

Models
We conduct a comprehensive evaluation of SCITAB for various models, including table-based pretraining models, encoder-decoder models, open source LLMs, and closed source LLMs.We also study the human performance to analyze the upper bounds on SCITAB.
Table-based LLMs.These are pre-trained transformer models fine-tuned on tabular data.We choose three different models: 1) TAPAS (Herzig et al., 2020), a BERT-based model fine-tuned on millions of tables from English Wikipedia and corresponding texts, 2) TAPEX (Liu et al., 2022b), a model that fine-tunes BART (Lewis et al., 2020) on a large-scale synthetic dataset generated by synthesizing executable SQL queries and their execution outputs, and 3) TAPEX-Zero (Liu et al., 2023b), an enlarged version of TAPEX.For TAPAS and TAPEX, we use their fine-tuned version on TabFact (Chen et al., 2020) for table fact-checking.
Encoder-Decoder LLMs.We also use encoder-decoder models where both the input and output are sequences of tokens.To adapt the model to take the table as input, we flatten the table as a sequence following Chen et al. (2020).The in-put is then formulated as [ T ; P ; C; Q], where T is the linearized table, and Q is a question template "Based on the information in the table, is the above claim true?A) True B) False C) Unknown?".We choose FLAN-T5 (Chung et al., 2022), an improved T5 model (Raffel et al., 2020) pre-trained on more than 1.8K tasks with instruction tuning, which has achieved strong zero-shot/in-context performance on other fact-checking benchmarks.et al., 2022) and GPT-4 (OpenAI, 2023).We evaluate the setting that directly predicts the label and the Chain-of-Thought (CoT) (Wei et al., 2022) setting, which generates explanations before predicting the final label.We also include the Program-of-Thoughts (PoT) (Chen et al., 2022) model that has shown strong ability in solving complex numerical reasoning tasks.It first parses the reasoning steps as Python programs and then executes them on a Python interpreter to derive accurate answers.Since most claims in SCITAB also require numerical reasoning, we want to test whether program-guided reasoning can be extended to tablebased fact-checking.

Open
Human Performance.To examine how humans perform on our SCITAB dataset, we hired an annotator from our candidate annotators pool, following the same training procedure as other annotators.In the case of 2-class classification, we randomly selected 40 samples: 20 each for supported and refuted claims.For 3-class classification, we randomly selected 60 random samples, ensuring an even distribution of 20 samples across the three label categories (supported, refuted, and not enough information).The annotator took approximately 1.5 hours for the 2-class fact-checking task and 2 hours for the 3-class setting.We report the Macro-F1 scores at the bottom of Table 4.

Main Results
We evaluate all models under both zero-shot and in-context settings.In the zero-shot setting, the model does not have access to any in-domain data.
In the in-context setting, we provide three holdout examples as demonstrations.We report two sets of results: the 2-class case, where examples labeled as NEI are excluded (since some models cannot process NEI claims), and the 3-class case including all three labels.The results are shown in Table 4.We have five major observations.1.In general, all open source LLMs, including encoder-decoder models and decoder-only models, do not achieve very promising results on SCITAB and they still have a large gap from human performance.The best result is 63.62 for the 2-class setting (Vicuna-7B and 38.05 for the 3-class setting (FLAN-T5-XL).Both results are only moderately better (+13.62 and +4.72) than random guessing.In contrast, a well-trained human annotator can achieve 92.46 and 84.73 F1 scores in the 2class and 3-class settings, respectively.This reveals the challenging nature of SCITAB and its potential to be the future benchmark for scientific fact-checking.

Counter-intuitively, table-based LLMs do not
outperform models pre-trained on pure texts, for example, FLAN-T5.This discrepancy may be attributed to the dissimilarity between the distribution of tables in scientific literature and publicly available table corpus.For example, scientific tables commonly include both row and column headers, whereas most tables in Wikipedia lack row headers.Meanwhile, the claims in our dataset are usually much longer than those in previous works, raising challenges to table-based LLMs.
3. The results in the 3-class setting are notably poorer than those in the 2-class setting.This discrepancy reveals the challenges that most models face when confronted with the NEI class.One plausible explanation could be the inherent difficulty in distinguishing between 'refuted' and 'NEI' claims -a task that even trained human annotators struggle with, as noted by Jiang et al. (2020).Our forthcoming error analysis will further demonstrate that the inclusion of the NEI class tends to diminish the models' confidence, causing a shift in their predictions from 'supported/refuted' to 'NEI'.4. Interestingly, the provision of in-context examples does not result in improved performance for the majority of models.This observation is somewhat expected for open source LLMs as they have not been reported to possess in-context learning capabilities.Nonetheless, it is surprising to find that even with chain-of-thought prompting, in-context demonstrations do not yield positive effects for In-structGPT and GPT-4.Our error analysis on the PoT offers some insight into this phenomenon and will be discussed in the next section.
5. Closed source LLMs perform better than open source LLMs, with GPT-4 achieving 78.22 macro-F 1 for the 2-class setting and 64.80 for the 3-class setting.This aligns with the assertion that GPT-4 has a strong ability to perform complex reasoning (OpenAI, 2023) and we show that this ability can generalize to tabular data as well.However, the black-box nature of OpenAI models restricts our further analysis of its behavior.

Error Analysis
InstructGPT and GPT-4.We show the confusion matrices for InstructGPT and GPT-4 under the zero-shot 3-class setting in Figure 4. We find that both models have difficulty in accurately predicting the NEI class.InstructGPT displays a pattern of "less confident", frequently classifying supported and refuted claims as 'NEI'.In contrast, GPT-4 exhibits overconfidence, incorrectly categorizing NEI claims as either supported or refuted.This corroborates our earlier observation that distinguishing whether a claim is verifiable is one of the key challenges for SCITAB.
Further, we also examine individual error instances, with typical examples provided in Figures 11 and 12 of Appendix F. The majority of 'supported' claims that were incorrectly classified as 'refuted' (Case 6) involve numerical reasoning or comparison.Conversely, when 'refuted' claims are inaccurately predicted as 'supported' (Case 3), we find that LLMs often overlook claims containing negation, indicating a lack of deep comprehension.For cases where 'supported' or 'refuted' claims are erroneously predicted as 'NEI' (Cases 1 and 2), such claims typically demand extensive reasoning and a deep understanding of the research findings.Interestingly, when faced with these complex cases, the model tends to default to the safer choice of 'uncertain' (NEI).
PoT. Unexpectedly, incorporating a Python interpreter does not confer any advantage on our dataset (as shown in Table 4), despite its positive Table 5: The error types and their estimated proportions for incorrectly-predicted samples in PoT.
impacts on other numerical reasoning tasks.In order to understand this, we randomly selected 50 claims wherein the PoT incorrectly predicted the final veracity labels and evaluated the quality of the generated Python programs.We divide the errors into four categories, as assessed by human annotators: (i) Grounding errors, where the program incorrectly associates data with the respective cells in the table ; (ii) Ambiguity errors, where the claim contains ambiguous expressions that the program fails to represent; (iii) Calculation errors, where incorrect floating point arithmetic calculation in Python lead to inaccurate results and (iv) Program errors, which encompass mistakes such as incorrect or missing arguments/variables, and erroneous operations.We present the error analysis in Table 5, and examples of program errors can be found in Figure 13 and Figure 14 in Appendix G.
Compared to other datasets, categories (i) and (ii) present unique challenges in our dataset.Category (i) underlines the difficulty in accurately referencing the specific cells to which a claim refers.Category (ii), on the other hand, emphasizes the difficulties posed by the ambiguous nature of scientific claims, such as "A is significantly better than B", to program-based methods.This connection further emphasizes the contribution of our work in addressing the mismatches between reasoning types and the occurrence of grounding errors.
To bridge this gap, we construct SCITAB which contains complex claims from authentic scientific papers with table-based evidence.(Yin et al., 2016;Yu et al., 2018) or employ graph neural networks to capture logical structure in statements, e.g., LogicFactChecker (Zhong et al., 2020) and ProgVGAT (Yang et al., 2020).However, these approaches often struggle with generalization, as they are tightly bound to specific table formats and language patterns.To address this, we have seen a shift toward table pre-training, with the advent of Table -BERT (Chen et al., 2020), TAPAS (Herzig et al., 2020), SaMoE (Zhou et al., 2022), PASTA (Gu et al., 2022), and DATER (Ye et al., 2023).These methods encode sentence-table pairs using language models and transform table-based reasoning into question-answering or natural language inference.In our work, we focus on evaluating pretraining-based methods on SCITAB because they not only demonstrate superior performance but also offer the benefits of few-shot learning.

Conclusion and Future Work
We present SCITAB, a novel dataset for scientific fact-checking that addresses the limitations of existing benchmarks.By incorporating realworld scientific claims and their corresponding evidence in the form of tables, SCITAB offers a more comprehensive and fine-grained representation of scientific reasoning.The challenging nature of SCITAB is evident from the performance of the state-of-the-art, highlighting the need for further research.For example, we believe that addressing the challenges posed by ambiguous claims represents a crucial direction for research in scientific fact-checking (Glockner et al., 2023;Liu et al., 2023a).One potential approach is to enhance the disambiguation of ambiguous claims by leveraging contextual information or external knowledge sources.Additionally, studying the compositionality in

Ethics Statement
We have received approval from the Institutional Review Board (IRB)3 for our data collection.The IRB reviewed our experimental design and research procedures to ensure that they do not pose more than minimal risks to research participants.We take steps to protect research participants' privacy and the confidentiality of their data.The review process took two months to complete.

Limitations
Firstly, the method and dataset are primarily designed for languages with limited morphology, such as English.Secondly, our SCITAB dataset is specifically focused on fact-checking scientific claims based on tables, which represents only one aspect of scientific fact-checking.Further research can explore the integration of other forms of evidence, including textual evidence and figure evidence, to enhance the fact-checking process.Thirdly, our SCITAB dataset is primarily focused on numerical reasoning types, as it is derived from the SciGen dataset, which also emphasizes numerical reasoning.It would be beneficial for future studies to incorporate a wider range of reasoning types to provide a more comprehensive fact-checking framework.Lastly, it would be valuable to explore additional annotation types, such as reasoning graphs, to further enrich the depth of analysis and capture more intricate relationships within the claims and evidence.

A Claim Extraction Procedure
A.1 Claim Definition In academic writing (Lee et al., 2009), the accompanying text for data, presented as tables and figures), typically includes three fundamental elements as outlined below.These elements encompass the definition of claims, which involve highlighting key data (KD) and commenting on key data (COM) that emphasizes and comments on the key data.

Location of results (LOC).
Statements that locate where the figure/table is found, e.g., Figure 7 displays the mean percentile scores.
Highlighting of key data (KD).Statements that highlight the important data, e.g.,(1) Highest or lowest values (2) Overall trend or pattern in the data (3) Points that do not seem to fit the pattern or trend, etc. ( 4) Results which provide answers to your research questions Commenting on key data (COM).Statements that interpret the data.There are three types of comments: (1) Generalization (deductions and implications drawn from the results), e.g., "This indicates that ..." (2) Comparison of results with those from prior studies, e.g., "Different from ..." (3) Explanation or speculation (possible reasons or cause-effect relationships for the results), e.g., "The possible reason is that ..."

A.2 Claim Extraction Interface
Figure 5 shows the user interface for the claim extraction task.
B Manual Claim Verification Procedure Next, a demonstration is given on how to navigate and utilize the annotation interface effectively.Following this, a series of trial tests are released to the annotators.This is to verify their understanding and capability in the task.Last, we specify the deadline for completing annotations, outline how we check the quality of their work, brief them on a post-annotation survey, and explain the reimbursement procedure.A Q&A session is also incorporated to address any uncertainties or concerns.After receiving their reimbursement, the annotators signed an agreement sheet to ensure its receipt.

B.2 NEI Claim Verification Interface
Figure 6 shows the user interface for the NEI claim verification task.

B.3 Refuted Claim Verification Interface
Figure 7 shows the user interface for the refuted claim verification task.

B.4 Annotation Post-Survey
Figure 8 shows the examples of post-annotation survey questions and the answers of annotators.

C Analysis of Refuted Reasons in the Sci-Fact dataset
Table 6 provides an analysis of the reasons for refuted claims in the Sci-Fact dataset, along with their estimated proportions.A random sample of 100 refuted claims was selected, and the results indicate that 85% of claims were simply negated using terms like "not" or paraphrased based on the evidence sentences.Additionally, 6% of the refuted claims were attributed to incorrect calculation results, while 6% were identified as having wrong commonsense knowledge.A smaller proportion of refuted claims (3%) were found to have incorrect open-domain knowledge.

D Discussions on Human-Machine Collaboration
Our final data creation pipeline undergoes repetitive testing and revision until it reaches its current Case D The values in the claim do not match.
The value in the claim does not align with the cor-responding value in the table.The correct value should be 27.9.
Case E The operation type is wrong.It applies the incorrect operation type.For instance, in the case of GCN+RC+LA (9), it is not accurate to claim that it is better than DCGCN1 because 22.9 > 22.0 and 53.0 > 52.6.Error Type 1: Supported predicted as NEI.

F Error Cases for InstructGPT
This error type indicates a discrepancy between the gold label, which is Supported, and the predicted label, which is NEI.
Error Type 2: Refuted predicted as NEI.This error type indicates a discrepancy between the gold label, which is Refuted, and the predicted label, which is NEI.
Error Type 3: Refuted predicted as Supported.This error type indicates a discrepancy between the gold label, which is Refuted, and the predicted label, which is Supported.
Error Type 4: NEI predicted as Supported.This error type indicates a discrepancy between the gold label, which is NEI, and the predicted label, which is Supported.
Error Type 5: NEI predicted as Refuted.This error type indicates a discrepancy between the gold label, which is NEI, and the predicted label, which is Refuted.
Error Type 6: Supported predicted as Refuted.This error type indicates a discrepancy between the gold label, which is Supported, and the predicted label, which is Refuted.

G Error Cases for Program-of-Thoughts
Figure 13 and Figure 14 show five error examples of Program-of-Thoughts when applied to our SC-ITAB dataset.Below, we provide explanations for each of the error cases.
Error Case 2. It exhibits incomplete entity linking (Grounding error).The program should also parse other baseline results, such as 'SFEGAN_WER = 14.9".
Error Case 3. It fails to generate a correct program (Program error).The variables and logical functions in the programs are incorrect.For in-Figure 7: The user interface for the refuted claim verification task stance, "G2S_GAT_BLEU_LDC2015E86" should be "G2S_GIN_BLEU_LDC2015E86".The logical function "and" should be replaced with "or".
Error Case 4. It fails to generate a precise program for the approximation word "comparable" (Ambiguity error).Currently, the program defines "comparable" as "larger than", which is not accurate enough.
Error Case 5.It generates the correct program, but the calculation result is inaccurate due to incorrect float digits in the Python code (Calculation error).For instance, Python may output '1.9499999', which is not equal to '1.95'.

Annotation Post Survey
Annotator 1: • Is the task demonstration clear to you? Yes, clear.• What do you think is the difficulty of this task?(1-10 points, 10 points is the most difficult) 5-6.• Which part is the most difficult for you?Why?
Judgment, understanding the way of original author think.• Do you think the annotation batch is appropriate?What is the maximum batch amount for you in a week?Yes. 2 batches in a week during the examination.4 during vacation.• Could you provide some advice on how to improve the annotation platform?
Looping for multiple operations.
Annotator 2: • Is the task demonstration clear to you? Yes.
• What do you think is the difficulty of this task?(1-10 points, 10 points is the most difficult) 6 • Which part is the most difficult for you?Why?
Table understanding; different parameters in the attributes.• Do you think the annotation batch is appropriate?What is the maximum batch amount for you in a week?Ok. 2-3 batches • Would you like to attend this session again as a 2-week participation?ok.• Could you provide some advice on how to improve the annotation platform?I preferred to write down the annotations on the platform.
Annotator 3: • Is the task demonstration clear to you? Yes, clear.the difficulty is different between demo and real annotation.• What do you think is the difficulty of this task?(1-10 points, 10 points is the most difficult) 7 • Which part is the most difficult for you?Why?
Table understanding-vocabulary.• Do you think the sample amount is appropriate?What is the maximum batch amount for you in a week (1 batch contains 20 samples)?10-15 samples for an hour.50 samples a week.
• Would you like to attend this session again as a 2-week participation?Maybe not.But 15 samples offline for a week is ok.
• Could you provide some advice on how to improve the annotation platform?I think the current platform is fine for me.For ``-Hierachical -Attn '' , we remove the residual connections between the attention modules in the CMR decoders and all the attention memory access are based on the output from the LSTM .For ``-MLP ' ', we further replace the MLP with a single linear layer with the non -linear activation .Claim: [ CONTINUE ] The effectiveness of our hierarchical attention design is proved by an accuracy drop of 1.95% after removing residual connections and the hierarchical stack of our attention modules .
Question: Is the above claim true or false ?Let us think step by step .

Reasoning:
To verify this claim , we need to compare the performance of COMER with and without hierarchical attention design .
The performance of COMER without hierarchical attention design is 86.69% ( table [3 ,2]).Therefore , the performance of COMER with hierarchical attention design is higher than the performance of COMER without hierarchical attention design by 88.64 -86.69 = 1.95.Therefore , the Claim is true .

Answer:
Therefore , the claim is true .
The performance of CMOW -C on BigramShift is 66.0 ( table [1 ,2]).Therefore , the performance of CMOW -C is higher than the performance of CMOW -R on WordContent by 66.8 -72.9 = -6.1.The performance of CMOW -C is lower than the performance of CMOW -R on BigramShift by 66.0 -70.8 = -4.8.Therefore , the Claim is false .

Answer:
Therefore , the claim is false .

H.4 Program-of-Thoughts Prompts
Read the following table and then write Python code to answer a question : ( please call the function equal (a , b) to check whether a and b are equal ) Caption: Table 4: The ablation study on the WoZ2 .0dataset with the joint goal accuracy on the test set .
For ``-Hierachical -Attn '' , we remove the residual connections between the attention modules in the CMR decoders and all the attention memory access are based on the output from the LSTM .For ``-MLP ' ', we further replace the MLP with a single linear layer with the non -linear activation .

Figure 2 :
Figure 2: The human-model collaboration construction process of SCITAB, which contains three steps: 1) data preparation (including data preprocessing and claim extraction) 2) automatic claim generation (including refuted and Not Enough Info claim generation) and 3) manual claim verification.

Figure 3 :
Figure 3: The distribution histogram of reasoning steps in our SCITAB dataset.The x-axis is the reasoning steps in each claim, and the y-axis is the frequency for each reasoning step.The shallow claims (with 1-2 reasoning steps) are highlighted in red, while the deep claims (with 3+ reasoning steps) are highlighted in blue.
Source LLMs.We also evaluate the performance of state-of-the-art open source LLMs, including 1) LLaMA (Touvron et al., 2023), the first open-source model by Meta AI; 2) Alpaca (Taori et al., 2023), an instruction-following language model fine-tuned on LLaMA; and 3) Vicuna (Chiang et al., 2023), the arguably best-performed opensource LLMs that claimed to achieve 90% quality compared to OpenAI ChatGPT.We use the same input format as in the encoder-decoder model.Closed Source LLMs.These are closed-source LLMs that require API calls for inference, including InstructGPT (text-davinci-003) (Ouyang

Figure 11 and
Figure 11 and Figure 12 show six error examples of InstructGPT in the zero-shot setting when applied to our SCITAB dataset.

Figure 5 :
Figure 5: The user interface for the claim extraction task.

Figure 6 :
Figure 6: The user interface for the NEI claim verification task.

Figure 8 :
Figure 8: The examples of post-annotation survey questions and the answers of annotators.
Is the above claim true or false ?Let us think step by step .
Our single model achieves 27.6 BLEU points, which is the new state-of-the-art result for single models.

Figure 10 :
Figure 10: The refuted claims cases D and E. Case D represents the values in the claim do not match.Case E represents the operation type is wrong.

Table 2 :
Applicability (App.),Productivity (Prod.) and Coverage (Cov.) of the various words in the alternatives of the COPA dev set.

Table 1 :
Comparison of SCITAB to three recent table fact verification datasets: TabFact Closed-domain knowledge Extract information from context sentences in the table caption or article.

Table 3 :
T , a table fact-checking model F predicts a label Y to verify whether C is supported, refuted, or can not be verified by the Refuted and NEI reasons and their estimated proportions (Prop.) in SCITAB.

Table 4 :
Macro-F 1 of baselines on SCITAB for different settings.The # of Para.indicates the number of parameters in the models.The TAPAS and TAPEX models are fine-tuned on the TabFact dataset, while others perform zero-shot learning.The bold text indicates the best performance among I to III, while the underlined text indicates the overall best performance among all the models.

Table - based
Reasoning.Table-based reasoning requires reasoning over both free-form natural language queries and (semi-)structured tables.Early works either rely on executable languages (e.g., SQL and SPARQL) to access the tabular data

table -
Case C The claim is partially right.The claim is generally correct, with the exception of the BShift column which does not fulfill the claim.