Leveraging Data Recasting to Enhance Tabular Reasoning

Creating challenging tabular inference data is essential for learning complex reasoning. Prior work has mostly relied on two data generation strategies. The first is human annotation, which yields linguistically diverse data but is difficult to scale. The second category for creation is synthetic generation, which is scalable and cost effective but lacks inventiveness. In this research, we present a framework for semi-automatically recasting existing tabular data to make use of the benefits of both approaches. We utilize our framework to build tabular NLI instances from five datasets that were initially intended for tasks like table2text creation, tabular Q/A, and semantic parsing. We demonstrate that recasted data could be used as evaluation benchmarks as well as augmentation data to enhance performance on tabular NLI tasks. Furthermore, we investigate the effectiveness of models trained on recasted data in the zero-shot scenario, and analyse trends in performance across different recasted datasets types.


Introduction
Given a premise, Natural Language Inference (NLI) is the task of classifying a hypothesis as entailed (true), refuted (false) or neutral (cannot be determined from given premise).Several large scale datasets such as SNLI (Bowman et al., 2015), MultiNLI (Williams et al., 2018), and SQuAD (Rajpurkar et al., 2016) explore NLI with unstructured text as the premise.
While textual inference is commonly researched, structured data (e.g.tables, knowledge-graphs and databases) enables the addition of more complex types of reasoning, such as ranking, counting, and aggregation.Creating challenging large scale supervision data is vital for research in tabular reasoning.In recent years, many initiatives to include tabular semi-structured data as the premise have been Table 1: Example of Tabular Data Recasting introduced e.g.tabular inference (TNLI) datasets such as TabFact (Chen et al., 2020b), InfoTabS (Gupta et al., 2020) and shared tasks like SemEval 2021 Task 9 (Wang et al., 2021b) and FEVEROUS (Aly et al., 2021).Tabular data differs from unstructured text in the way that it can capture information and relationships in a succinct manner through underlying structure (Gupta et al., 2020).
Despite fluent, diverse, and creative, these human-annotated datasets are limited in scale owing to the costly and time-consuming nature of annotations.Furthermore, Gururangan et al. (2018) and Geva et al. (2019) show that many humanannotated datasets for NLI contain annotation biases or artifacts.This allows NLI models to learn spurious patterns (Niven and Kao, 2019), which enables models to predict the right label for the wrong reasons, sometimes even with noisy, incorrect or incomplete input (Poliak et al., 2018b).Recently, Gupta et al. (2021) revealed that tabular inference datasets also suffer from comparable challenges.Furthermore, Geva et al. (2019); Parmar et al. (2022) show that annotators introduce their own bias during annotation.For example, Gupta et al. (2021) demonstrates that annotators only generate hypothesis sentences from keys having numerical values, implying that some keys are either over or underutilized.
On the other hand, automatic grammar-based strategies , despite their scalability, lack linguistic diversity (both structural and lexical) and necessary inference complexity.These approaches create examples with only naive reasoning.Recently, (Geva et al., 2020;Eisenschlos et al., 2020), create context-free-grammar and templates to generate augmentation data for Tabular NLI.Additionally, the use of a large language generation model (e.g., Zhao et al., 2022;Lewis et al., 2020;Raffel et al., 2020) for data generation has been proposed as well (e.g., Zhao et al., 2022;Ouyang et al., 2022;Mishra et al., 2022).However, such generation systems lack factuality, leading in hallucinations, inadequate fact coverage, and token repetition problem.
Can we generate challenging supervision data that is as scalable as synthetic data and yet contains human-like fluency and linguistic diversity?In this work, we attempt to answer the above question through the lens of data recasting.Data recasting refers to transforming data intended for one task into data intended for another distinct task.Although data recasting has been around for a long time, for example, QA2D (Demszky et al., 2018) and SciTail (Khot et al., 2018) effectively recast question answering data for inference (NLI), no earlier study has applied it to semi-structured data.
Therefore, we propose a semi-automatic framework for tabular data recasting.Using our framework, we generate large-scale tabular NLI data by recasting existing datasets intended for non NLI tasks such as Table2Text generation (T2TG), Tabular Question Answering (TQA), and Semantic Parsing on Tables (SPT).This recasting strategy is a middle road technique that allows us to benefit from both synthetic and human-annotated data generation approaches.It allows us to minimise annotation time and expense while maintaining linguistic variance and creativity via human involvement from the original source dataset.Table 1 shows an example of tabular data recasting.Note that while the steps we describe for recasting in this work are automatic, we call the overall end-to-end framework semi-automatic to include the manual effort gone into creation of the source datasets.
Our recasted data can be used for both evaluation and augmentation purposes for tabular inference tasks.Models pre-trained on our data show an improvement of 17% from the TabFact baseline (Chen et al., 2020b) and 1.1% from Eisenschlos et al. (2020), a synthetic data augmentation baseline.Additionally, we report a zero-shot accuracy of 71.1% on the TabFact validation set, which is 5% percent higher than the supervised baseline accuracy reported by Chen et al. (2020b).
Our main contributions are the following: 1. We propose a semi-automatic framework to generate tabular NLI data from other non-NLI tasks such as T2TG, TQA, and SPT.
2. We build five large-scale, diversified, humanalike, and complex tabular NLI datasets sourced from datasets as shown in Table 4.
3. We present the usage of recasted data as TNLI model evaluation benchmarks.We demonstrate an improvement in zero-shot transfer performance on TabFact using recasted data.
We demonstrate the efficacy of our generated data for data augmentation on TabFact.
The dataset and associated scripts are available at https://recasting-to-nli.github.io

Why Table Recasting?
Tables are structured data forms.Table cells define a clear boundary for a standalone independent piece of information.These defined table entries facilitate the task of drawing alignments between relevant table cells and given hypotheses.If both the premise and hypothesis were plain text, any ngram in the premise could be aligned to any n-gram in the hypothesis.Finding alignments between the premise and hypothesis is crucial in the recasting process.This is because, in order to modify statements, we must understand which entities influence their truth value.
Moreover, in tables, entries of the same type (same part of speech type, named entity type, domain etc) are clubbed under a common column header.This allows us to easily identify a group of candidates which are interchangeable in a sentence without disrupting its coherence.Frequently, the column header also indicates the data type (e.g.Name, Organization, Year etc).This is incredibly beneficial when modifying source data by substituting entities.

Tabular Recasting Framework
In this section, we describe a general semiautomatic framework to recast tabular data for the task of Table NLI.By recasting, we imply converting data meant for one task into a format that satisfies the requirements of a different task.In addition to the table, we require at least one reference statement that validates the table.We utilize the structure of this reference statement (henceforth referred to as the Base Entailment) to generate further entailments and contradictions.Once we have the Base Entailment, contradictions can be formed fairly easily.Falsifying any part of the Base Entailment that is linked to the table creates contradictions.
To create an entailment, however, every portion of the perturbed statement must hold true for the entire statement to constitute an entailment.This means that all entities originating from the table (henceforth referred to as relevant entities) must be found in the Base Entailment.Then and only then can we know with certainty how perturbations affect the truth value of a given assertion.
Alignments between a table and a Base Entailment are not always apparent, as demonstrated in Table 2.In the example "Party A won the most seats, the alignment between most and the greatest number of seats must be determined.Although we can employ automatic matching techniques between the Base Entailment and the table to extract relevant entities, we cannot be certain of detecting all of them unless they are explicitly provided.Therefore, we must be able to extract the following from source datasets: (a) a table i.e. the Premise, (b) a reference statement i.e. the Base Entailment, (c) relevant entities and (d) their alignments with the reference statement.
Once the prerequisites are met, new NLI instances can be formed by perturbing existing data in two ways: (a) by perturbing the hypothesis and (b) by perturbing the table, i.e. the premise.

Perturbing the Hypothesis
We modify the hypothesis, i.e. the Base Entailment, by substituting relevant entities with other potential candidates.We presume the tables are vertically aligned, which means that the top row contains headers and each column contains entities of the same kind.A potential candidate for a relevant entity coming from table cell C XY having coordinates [row X, column Y ] can be any other non-null entity from the same column Y .

Creating Entailments (E)
To create entailments, we replace all the relevant entities in the given Base Entailment with potential candidates.Two or more relevant entities coming from table cells in the same row, say C XA , C XB , must be substituted with potential candidates from column A and B respectively, such that their row coordinate is equivalent Table 2).Entities originating from "aggregate rows" (such as the Total row in Table 2) or "headers" must be left intact.

Creating Contradictions (C)
To create contradictions, we substitute one or more relevant entities from the Base Entailment with alternative candidates.We note that the ensuing statement may be an entailment by accident.In Table 2, consider the Base Entailment -"Party B won 89 seats".Suppose we replace one key entity (Party B) with a potential candidate to arrive at "Party B Party C won 89 seats".The resultant statement remains an entailment.To prevent this from occurring, the nonreplaced entities are compared.Assume C XA , C XB represent the relevant entities in the Base Entailment.If we replace C XA → C ZA then we must guarantee that C XB ̸ = C ZB to avoid unintentional entailments.
We also generate contradictions by substituting antonyms for words in the Base Entailment.This is particularly helpful for scenarios involving superlatives and comparatives (refer Table 2).We use NLTK Wordnet to find antonyms.We enforce equality of POS tags for word-antonym pairs.We create a dictionary of such pairs across datasets, count their frequency and broadly filter (manually) the word-antonym pairs for keeping the frequent and sensible ones.Majority of these are comparative/superlative word-pairs (higher-lower, mostleast, best-worst).

Perturbing the Table (Premise)
In this subsection, instead of modifying the Base Entailment, we swap two or more table cells to modify the premise, i.e. the tables.Similar to Kaushik et al. (2020) and Gardner et al. (2020), we build example pairs with minimal differences but opposing inference labels in order to improve model generalisation.These modified tables no longer reflect the actual world.Hence, we refer to them as Counterfactual.The addition of counterfactual data increases the model's robustness by preventing it from learning spurious correlations between label and hypothesis/premise.Minimally varying counterfactual data also ensures that the model is not biased and preferably grounds on primary evidence, as opposed to depending blindly on its pre-trained knowledge.Similar findings were made by Müller et al. (2021) for TabFact.

Creating Counterfactual Tables (CF)
We consider a contradiction C1 formed by replacing the relevant cell C XA → C ZA in the original table (as described in Table 2).To create a counterfactual table, we swap cells C XA ↔ C ZA such that C1 becomes an entailment to the modified table, and the original Base Entailment becomes a contradiction to it.Based on this, we generate further hypotheses, as illustrated in Table 2.
Hypothesis Paraphrasing (HP) Dagan et al. (2013) demonstrates that data paraphrasing increases lexical and structural diversity, thus boosting model performance on unstructured NLI.In accordance with Dagan et al. (2013), we paraphrase our data because the hypotheses derived from Base Entailments have similar structures.For producing paraphrases, we employ the publicly available T5 Model (Raffel et al., 2020) trained on the Google PAWS dataset (Zhang et al., 2019).We produce the top five paraphrases and then select at random from among them.

Addressing Tabular Recasting Constraints
Dataset specific implementations pose several challenges.We address them in the following ways: Table Orientation.As stated in Section 3, we conduct all experiments assuming the tables are vertically aligned.We observe several horizontally aligned tables (with the first column containing headers) in source datasets.As a preliminary processing step, we employ heuristics to automatically recognise such tables and flip them.For example, we check for frequently occurring header names in the first column or consistency in data types (numeric, alpha, etc.) across rows rather than columns.
Partial Matching.We observe that some datasets provide relevant cells, but do not provide their alignments with the Base Entailment.We attempt to match every relevant cell with n-grams in the Base Entailment.Of particular interest is the sample row shown in  Irreplaceable Entities.We observe that not all relevant entities are replaceable by potential candidates.Table 2 presents an example of a table with a Total row.Relevant entity 298 cannot be replaced while creating New Entailment OG because it is an aggregate entity that whose substitution will disrupt the truth value of the statement.
Similar observation is made while swapping table cells to create counterfactual tables.Suppose we swap the aggregate cell 298 with 120 .The resultant table would be logically flawed since the "Seats" column won't add up to its Total.To prevent this, aggregate rows and header cells are marked as non-replaceable entities.

Dataset Recasting
Using the framework outlined in Section 3, we recast the five datasets listed in Table 4.All datasets utilise open-domain Wikipedia tables, comparable to TabFact.In addition, these datasets and TabFact share reasoning kinds such as counting, minimum/maximum, ranking, superlatives, comparatives, and uniqueness, among others.Table 5 summarises the statistics of recasted datasets.1

Table2Text Generation to Table NLI
Given a table and a set of highlighted cells, the Ta-ble2Text generation task is to create a description derived from the highlighted cells.We presume this description to be the Base Entailment given that it is true based on the table.The highlighted cells become the relevant entities.An example is shown in Table 2, where Base Entailment OG is a description generated from OG Table's highlighted cells .

Source Dataset
Task WikiTableQuestions (Pasupat and Liang, 2015a) TQA FeTaQA (Nan et al., 2022) TQA Squall (Shi et al., 2020b) SPT WikiSQL (Zhong et al., 2017) SPT ToTTo (Parikh et al., 2020) T2TG  Recasting WikiTableQuestions (Pasupat and Liang, 2015a).WikiTableQuestions provides 22k questions over Wikipedia tables, with shortform answers.We use a T5 based pre-trained model developed by Chen et al. (2021a) to convert {Question, Answer} → Statement (refer Table 1).We presume this to be our Base Entailment.Unless it is an aggregate value, the short-form answer is likely to be an entity from the table.We search for matches between the answer and table cells as well as n-grams in the question.We create contradictions from any relevant entities found.
Recasting FeTaQA (Nan et al., 2022).FeTaQA provides 10k question-answer pairs on Wikipedia tables.These long form answers are Base Entailments in themselves.Since supporting cell information is provided as well, we create both entailments and contradictions wherever possible.

Example of FeTaQA Recasting
Q: What was the total number of seats?
Long answer: There were 298 seats in total.   replace 298 with 89

  
Entailment: There were 298 seats in total.Contradiction: There were 89 seats in total.

Semantic Parsing to Table NLI
Given a table and a question, the Semantic Parsing task is to generate the underlying logical/SQL form of the question.Since the datasets provide a logical query form, we execute this query to obtain a "short-form answer".We combine the question and short form answer as mentioned in 5.2 to get the Base Entailment.SQL queries are parsable, allowing for easy identification of column names and cell values.Owing to the fact that the reasoning depends on these entities, we infer these are the relevant entities.
Recasting WikiSQL (Zhong et al., 2017).Wik-iSQL provides 67.7k annotated [SQL query, textual question] pairs on Wikipedia tables.To augment this data, we parallelly replace values in an SQL query and its corresponding question.We execute the new query, and combine the answer with the perturbed question to create a new entailment.Note that when executing a query, the answer can be a single entity or a list of multiple entities.If we have a list of entities satisfying the query, any of these entities can be used to create entailments, while none of these entities should be used to create contradictions (we find other potential candidates from the answer column).Consider the OG table given in Table 2.

Example of WikiSQL Recasting
Recasting Squall.Squall provides 11k [sql query, textual question] pairs with table metadata.We augment it similar to WikiSQL.Furthermore, table metadata enables us to identify column kinds and, in some circumstances, reduce SQL queries and questions to skeletons.These skeletons may subsequently be used to generate hypotheses on additional tables that meet the column type specifications of the skeleton in question.Consider the example from Table 2 with columns Party (text) and Seats (numeric).This can now be used on another table, suppose one about countries and their populations to ask "Which country has the maximum population?".

Human Evaluation
We asked five annotators to annotate fifty samples from each dataset on two fronts: • Inference label: Label each sample as entail, refute or neutral.Neutral samples can either be those which can't be derived from the table, or which don't make sense.
• Coherence score: Score each sample on a scale of 1 to 3 based on its semantic coherence and grammatical correctness, 1 being incoherent and 3 being coherent with minor or no grammatical issues.A score of 2 is given to statements who's meaning can be understood, but the structure or grammar is incorrect in more than one place.
We compare our generated label with the majority annotated inference label, and if no majority was reached, we consider the sample inconclusive.For Coherence score, we calculate the average of the five annotators.
Analysis.Results are summarized in Table 6.We observe high label match scores for our datasets, with QA-TNLI at 90%, Squall-TNLI at 87% and WikiSQL-TNLI at 84%.ToTTo-TNLI is slightly behind at 78%, which is largely due to samples marked as "neutral" or samples where no majority was reached.We also observe a consistently above average coherence score, largely between 2.5 and 3.This implies that most of our data is logical, coherent, and grammatical.Since the sources of our data are human-written (Wikipedia text/human annotations), we expect our generated sentences to be fluent and semantically correct.

Experiments and Analysis
In this section, we examine the relevance of our recast data across various settings.Overall, we aim to answer the following research questions: 1. RQ1: How challenging is recast data as a TNLI benchmark?
2. RQ2: How effective are models trained on recasted data in a zero shot setting?

Experimental Setup
In all experiments, we follow the pre-training pipeline similar to Eisenschlos et al. (2020).We start with TAPAS (Herzig et al., 2020), a tablebased BERT model, and intermediately pre-train it on our recasted data.We then fine-tune the model on the downstream tabular NLI task.
Dataset.We use TabFact (Chen et al., 2020b), a benchmark Table NLI dataset, as the end task to report results.TabFact is a binary classification task (with labels: Entail, Refute) on Wikipedia derived tables.We use the standard train and test splits in our experiments, and report the official accuracy metric.TabFact gives simple and complex tags to each example in its test set, referring to statements derived from single and multiple rows respectively.Complex statements encompass a range of aggregation functions applied over multiple rows of table data.We report and analyze our results on simple and complex test data separately.

Results and Analysis
We describe the results of our experiments with respect to the research questions outlined above.

As Evaluation benchmarks (RQ1)
We randomly sample small subsets from each dataset, including counterfactual tables, to create test sets.We evaluate the publicly available TAPAS-TNLI model (Eisenschlos et al., 2020) fine-tuned on Tab-Fact on the randomly sampled test sets, as shown in Table 7.We find that even though TabFact contains both simple and complex training data, the model gives a best accuracy of 68.6%, more than 12 points behind its accuracy on the TabFact set.
Analysis.The TAPAS-TNLI model performs best on WikiSQL-TNLI data, showing either that WikiSQL is most comparable to TabFact (in terms of domain, reasoning, and writing) or that Wik-iSQL is relatively trivial to address.Squall-TNLI is the hardest, as expected, as Squall was designed specifically to include questions that execute complex SQL logic.QA-NLI and ToTTo-NLI lie inbetween, showing that they have some similarities with TabFact, but also incorporate complementary reasoning instances.

Zero Shot Inference Performance (RQ2)
Once we pre-trained our model on recasted TNLI data, it is in principle already a table NLI model.Since we create a versatile and large scale dataset, we look at the zero-shot accuracy of our models on the TabFact test set before fine-tuning, as shown in Table 8.Our best model gives 83.5% accuracy on the simple test set before fine-tuning.Its performance is 6.0% percent ahead of Table-BERT, a supervised baseline.Our best model also outperforms TAPAS-Row-Col-Rank (Dong and Smith, 2021), which is a model trained on synthetic NLI data, by 7% in the zero-shot setting.
Analysis.QA-TNLI achieves the best zero-shot performance of 71.1%.We speculate that joining two datasets (FeTaQA and WikiTableQuestions) helps the model learn a variety of linguistic structures and reasoning.This is closely followed by Combined-TNLI, a model trained on the mixture of all the datasets.We speculate that the model's training may have been negatively impacted by integrating too many distinct data kinds.Squall-NLI noticeably gives 62.6% accuracy on the complex test set, indicating its utility for learning complex reasoning.The zero-shot accuracy of TabFact trained models in Squall-NLI (i.e.Table Table 7 (Chen et al., 2020b).(Dong and Smith, 2021) gives the zero-shot accuracy of TAPAS-Row-Col-Rank on TabFact.
of Squall-NLI trained model on TabFact (55.1% vs 69.1%) clearly show that Squall-NLI is a superior dataset in terms of complexity of reasoning.ToTTo-TNLI performs fairly well on simple data (80.3%)but is not well equipped to handle complex examples.This is due to the "descriptive" nature of generation data, which includes limited inferential assertions.

Data Augmentation for TabFact (RQ3)
Since TabFact is a binary classification task with Entail and Refute labels, our recasting data can also be used for data augmentation.We pre-train the model with our recasted data, similar to Eisenschlos et al. (2020) (refer 6.1), before final fine-tuning on the TabFact dataset.Analysis.Following the zero-shot results (Table 8), QA-TNLI performs well as expected in the fine-tuned setting.We speculate that ToTTo-TNLI outperforms QA-TNLI due to their dataset size disparity (nearly 8x more, refer Table 5).The fact that WikiSQL-TNLI achieved the highest accuracy with TabFact-trained models (Table 7) and the lowest zero-shot accuracy (Table 8) on TabFact indicates that the data is relatively non-complex.Squall-TNLI does not improve model performance after augmention despite its remarkable zero-shot performance (Table 7).We suspect that this is because the domains and types of underlying logic (a.k.a.reasoning types) are quite distinct.We also combine all datasets (in equal rates) to train a composite TNLI model.Its accuracies are not at par with our best model.There can be several reasons behind this, one being that our mixing strategy isn't optimal.We could, for example, train for one dataset at a time and then slowly go on to the next, instead of mixing all datasets at each stage in equal proportions.This can be further investigated in the future.Another possibility is that the datasets include distinct types of data, such that merging them all has a detrimental effect.

Conclusion
In this paper we introduced a semi-automatic framework for recasting tabular data.We made our case for choosing the recasting route due to its cost effectiveness, scalability and ability to retain human-alike diversity in the resultant data.Finally, we leveraged our framework to generate NLI data for five existing tabular datasets.In addition, we demonstrated that our recasted datasets could be utilized as evaluation benchmarks as well as for data augmentation to enhance performance on the Tabular NLI task on TabFact (Chen et al., 2020b).

Limitations
While our work on tabular data recasting produces intriguing outcomes, we observe the following limitations in our approach: 1. Source datasets are designed for tasks different than the target.While our methodology assures that recasted data retains the strengths and positive qualities of its original source, we have observed that some of these traits may not necessarily coincide with the targeted task.For instance, generation tasks provide "descriptions", therefore the annotated data is descriptive in nature, but it is unlikely to contain complicated reasoning involving common sense and table-specific knowledge.In addition, any faults in the original data (e.g.bias issue) may get transferred to the recasted version.
2. Although the domains of source and target tasks can be comparable (in our example, open-domain Wikipedia tables), their distributions of categories, themes, and so on are likely to vary.When we train models using recasted augmentation data, we unintentionally introduce a domain transfer challenge.As a result, the final model's performance is influenced to some extent by domain alignment.
3. Tables are semi-structured data representations that differ not just in domains and writing style, but also in structure.For example, InfoTabS (Neeraja et al., 2021) is a collection of Infoboxes, which are tables that describe a single entity (person, organisation, location).These are very different from the databasestyle tables that we use in our research.Tables can also be chronological, nested, or segmented which makes them more challenging.
While we can employ our current heuristics to identify such tables, our current recasting strategy is prone to failure with tables that do not have database-like structures.
4. Annotated data sometimes relies on common sense and implicit knowledge that is not explicitly mentioned in the premise.Such data instances might be difficult to interpret automatically, making them challenging to recast.For example, in Table 10, to compare "Gold" with "Silver", the association of "Silver medal" with 2 nd place and "Gold medal" with 1 st place must be known.This implicit common-sense like knowledge makes this example hard to recast.
5. Our work on data recasting is done only on English language data.However, our proposed  framework is easily extensible to other languages, high resource and low resource alike.Since we depend on identifying and aligning entities (between premise and hypothesis), morphologically analytic languages are easier to work with.Highly agglutinative languages may require additional efforts such as morphanalysis.
B won 89 seats.Contradiction: Party B won 120 seats.
Q: Which party has the maximum seats?SQL: select party from T where seats=max(seats) Which C1text has the maximum C2num?SQL': select C1text from T where C2num=max(C2num)

Table 2 :
Pipeline for generating recasted NLI data.We first create entailments and contradictions from the given base annotation.We then create a counterfactual table taking a contradiction to be the new base annotation.subscriptOGrepresents the "Original" table and subscript CF represents the "Counterfactual" table.Note that Base Entailment OG contradicts Table CF and Base Entailment CF contradicts TableOG.This pair will always exhibit this property, but there can be statements which entail (or contradict) both OG and CF tables.
Table 3 that contains names, numbers, locations and dates that are not exact, but partial matches to n-grams in the Base Entailment.We handle such cases of partial matching.Obama's inauguration as the forty fourth president took place at the United States Capitol in 2009.

Table 3 :
An example of cases requiring partial matching.

Table 4 :
Source datasets used for creating tabular NLI data

Table 5 :
(Parikh et al., 2020)s recasted datasets.QA-TNLI combines recasted data from both FeTaQA and WikiTableQuestions.Test splits are created by randomly sampling 10% samples from each dataset.Recasting ToTTo(Parikh et al., 2020).ToTTo comprises of over 120k training examples derived from Wikipedia tables.Annotators edit freely written Wikipedia text to produce table descriptions.Annotators also mark relevant cells, but not their alignments with the Base Entailment.To link relevant cells with tokens in the Base Entailment, we apply partial matching techniques.If all relevant cells are successfully matched, we proceed to build new entailments.In either scenario, contradictions are generated using any relevant cells for which alignments can be found.Table3illustrates this with an example.5.2TableQuestionAnsweringToTableNLIGiven a table and a question, the TableQuestionAnswering task is to generate a long form (sentence) or short form (one word/phrase) answer to the question.A long form answer is a Base Entailment in itself.Table1depicts an example of recasting QA data.

Table 6 :
Results for human evaluation of our generated data.Please note that the verification labels are considered to be matched only if annotators have reached a majority and it matches our generated label.

Table 7 :
) and that Accuracies for base and large TAPAS-TNLI model trained on TabFact and tested on recasted datasets

Table 8 :
Zero-shot accuracies for models trained on recasted data and tested on TabFact simple, complex and full dev set.Table-BERT and LPA Ranking are supervised baselines taken from TabFact Table 9 shows the performance after data augmentation.Our best model outperforms the Table-BERT and LPA Ranking baselines (Chen et al., 2020b) by 17 points, and Eisenschlos et al. (2020) by 1.1 points.
Model Dev Test full Test simple Test complex Test small Table-BERT-Horizontal(Chen et al.

Table Pre -
Eisenschlos et al. (2020)xplore pretraining through several tasks such as Mask Column Prediction in TaBERT(Yin et al., 2020), Multichoice Cloze at the Cell Level in TUTA(Wang et al., 2021c), Structure Grounding(Deng et al., 2021)and SQL execution(Liu et al., 2021).Our work is closely related toEisenschlos et al. (2020), which uses two pre-training tasks over Synthetic and Counterfactual data to drastically improve accuracies on downstream tasks.Pre-training data is either synthesized using templates(Eisenschlos Entailment: H: Micheal Phelps ranked better in 2012 than in 2016 for the 100m Butterfly event.

Table 10 :
An example table and an entailment derived from the same.