ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning Examples

Reasoning over tabular data requires both table structure understanding and a broad set of table reasoning skills. Current models with table-specific architectures and pre-training methods perform well on understanding table structures, but they still struggle with tasks that require various table reasoning skills. In this work, we develop ReasTAP to show that high-level table reasoning skills can be injected into models during pre-training without a complex table-specific architecture design. We define 7 table reasoning skills, such as numerical operation, temporal comparison, and conjunction. Each reasoning skill is associated with one example generator, which synthesizes questions over semi-structured tables according to the sampled templates. We model the table pre-training task as a sequence generation task and pre-train ReasTAP to generate precise answers of the synthetic examples. ReasTAP is evaluated on four benchmarks covering three downstream tasks including 1) WikiSQL-Weak and WikiTQ for Table Question Answering, 2) TabFact for Table Fact Verification, and 3) LogicNLG for Faithful Table-to-Text Generation. Experimental results demonstrate that ReasTAP achieves new state-of-the-art results on all of them and delivers a significant improvement under low-resource setting. Our code is publicly available at https://github.com/Yale-LILY/ReasTAP.


Introduction
Inspired by the massive success of pre-trained language models (LM) on free-form natural language (NL) tasks (Devlin et al., 2019;Dong et al., 2019;Raffel et al., 2020;Lewis et al., 2020), researchers have attempted to extend the pre-training to table data.Tables are a valuable form of data that organize information in a structured way.They often contain data that is organized in a more accessible manner than in unstructured texts.To adapt the pre-training paradigm on structured tab- The tables are crawled from Wikipedia.During preprocessing, we perturb the table row order to alleviate unwanted bias brought by table encoding.The colored cells are relevant facts necessary to answer the given question.Each color corresponds to a different table reasoning skill.And each reasoning skill corresponds to an example generator, which synthesizes QA pairs over tables according to the sampled templates.We model the pre-training task as a sequence generation task and pre-train REASTAP to generate correct answers given the flatten table and synthetic question.ular data, previous works mainly focus on designing models with table-specific architectures and pre-training methods.This includes introducing a structure-aware attention mechanism (Yin et al., 2020;Deng et al., 2020;Zayats et al., 2021), adding auxiliary structure indicative embeddings (Herzig et al., 2020;Eisenschlos et al., 2020;Wang et al., 2021b), and designing table-specific pre-training objectives (Yin et al., 2020;Yu et al., 2021a;Wang et al., 2021b;Liu et al., 2022b,a).While these methods are effective in understanding table structures, they increase the modeling complexity and lack interpretability on why models learns table reasoning skills during pre-training.
This paper presents a new table pre-training approach, named REASTAP, which enables a model to efficiently learn table structure understanding and table reasoning skills during pre-training.We first defined 7 table reasoning skills, such as numerical operation and temporal comparison.As shown in Figure 1, for each reasoning skill, a corresponding example generator was applied to synthesize Question Answering (QA) examples over tables.We modeled the pre-training task as a sequence generation task and pre-trained a sequence-to-sequence (seq2seq) LM to generate the answer to the synthetic questions.REASTAP is theoretically applicable to any seq2seq LM without a table-specific architecture design.Our key insight is that if a language model can be pre-trained to generate the answers to synthetic questions, which require various table reasoning skills, it should have a great table structure understanding and table reasoning capacity, thereby conferring benefits to downstream tasks.The main contributions of our work can be summarized as follows:

Example Generation
We defined 7 types of table reasoning skills, with examples and explanations shown in Table 1.The example generation pipeline was adapted from Yoran et al. (2021).Each reasoning skill is associated with one example generator and several question templates.The example generator was implemented as a function that takes a table T and generates several reasoning examples (T , q, a) according to the template, where q denotes the question, and a denotes the answer.Each template contains typed variables that are instantiated with content from the extracted table.Specifically, column col and cell value val are indexed to specify that val:i must be instantiated by a cell value from the i-th column.Some templates also regulate that the selected column and cell value must be date or number type.OPERATOR and ORDINAL correspond to operators and ordinal numerals that are instantiated according to the specific reasoning skill.And CONDITION:i can be 1) a cell value from the i-th column; or 2) a number/temporal comparison statement if the i-th column is date or number type.For example, the question from Figure 1  the ORDINAL col:3?"Once all variables in the sampled template were instantiated, we obtained question q.Then the example generator would programmatically return the corresponding answer a.

Example Sampling
After generating a vast number of QA examples for each reasoning skill, we had to sample pre-training data from these synthetic examples.In our setting, the portion of pre-training examples (Table 1) corresponding to each reasoning skill roughly matches the portion of logical operations defined in Tab-Fact (Chen et al., 2020b).We raised the portion of numerical operation skill as numerical reasoning is more challenging for models to learn.To increase the diversity of pre-training corpus, for each reasoning skill, we also sampled {SQL query, execution result} pairs from TAPEX (Liu et al., 2022b)  3 Pre-training REASTAP Task Formulation Each example in the synthetic pre-training corpus contains a question q and a semi-structured table T as the model input.The task objective is to generate an accurate answer string a = (a 1 , a 2 , . . ., a n ) given the question q and input table T : where θ denotes the parameters of a seq2seq LM.
In our experiments, we implemented REASTAP based on BART (Lewis et al., 2020), a widely used Transformer-based pre-trained model (Vaswani et al., 2017) that has proved its effectiveness on various comprehension and text generation tasks.
In our experiments, we chose BART-Large as a backbone, which has around 400M parameters and 12 layers in both encoder and decoder.
Data Serialization As illustrated in Figure 1, the input contains a question and its corresponding table.We flattened the table so that it can be fed directly into the encore-decoder model.Specifically, by inserting several special tokens to indicate the table boundaries, a flattened table is denoted as where [HEAD] and [ROW] are special tokens indicating the region of table headers and rows respectively.We prefixed the flattened table T with the question and feed them into the model encoder.
The decoder is tasked to generate the answer(s), separated by commas, autoregressively.

Downstream Tasks
We evaluated REASTAP on three different types of downstream tasks to verify its effectiveness.The statistics and examples for each task are shown in Table 2, and Table 10 in the Appendix, respectively.The fine-tuning of REASTAP is similar to the procedure for pre-training discussed in section 3. We modeled both downstream tasks as sequence generation tasks and leverage generative LMs to generate the output autoregressively.
We used the accuracy (i.e., percentage of correct predictions) as evaluation metric. Faithful
noting that higher BLEU scores do not correlate with better logical fidelity (Nan et al., 2022a).

Implementation Details
We implemented our models based on the fairseq library (Ott et al., 2019).We adopted BART-Large as the backbone model.For table pre-training, we synthesized and sampled 4M pairs of reasoning ex-amples.In the following sections, unless specified explicitly, all the experimental results were evaluated under the default settings of 4M reasoning examples and BART-Large configuration.Our pretraining procedure ran 80,000 steps with a batch size of 256, which took about 34 hours on an 8 NVIDIA A5000 24GB cluster.For downstream tasks, the fine-tuning procedure ran 30,000 steps with a batch size of 256.The best pre-training and fine-tuning checkpoints were both selected according to the validation loss.

Main Results
Table QA On WIKISQL-WEAK, REASTAP outperforms all the baselines (Table 3).Specifically, on the test set of WIKISQL-WEAK, REASTAP achieves a denotation accuracy of 90.4%, which is 4.3% higher than BART and 1.2% higher than the previous best performance.On the more challenging WIKITQ, as shown in Table 4, REASTAP also surpasses the previous best system by 1.4%.It is worth noting that compared to WIKISQL-WEAK, WIKITQ contains much fewer tables and examples, which makes the adaptation of BART to tabular structures more challenging.Further REASTAP obtains an improvement of 21.1% over BART, indicating that in the low data regime, the improvements brought by REASTAP are more significant.We also evaluated REASTAP performance under low-resource settings (Section 6.1).(Zhong et al., 2020) 71.8 71.7 85.4 65.1 74.3 HeterTFV (Yang et al., 2020) 72.5 72.3 85.9 65.1 74.2 SAT (Zhang et al., 2020) 73.3 73.2 85.5 67.2 -TaPas (Herzig et al., 2020) 81.0 81.0 92.3 75.6 83.9 TableFormer (Yang et al., 2022) 82.0 81.6 93.3 75.9 84.6 DecompTaPas (Yang and Zhu, 2021)  tinued pre-training REASTAP on our pre-training corpus that is irrelevant to the text generation task.However, REASTAP significantly improves the logical-fidelity scores, enhancing the SP-Acc and NLI-Acc by 4.7% and 5.5%, respectively.The results demonstrate that REASTAP can also help improve faithful text generation.

Analysis
Experimental results on three different kinds of downstream tasks show that REASTAP can broadly improve BART's generic table reasoning capabilities, which could be adapted to different downstream tasks, regardless of whether the tasks are highly similar to the REASTAP pre-training task or not.In this section, we further analyze our approach in terms of various aspects to provide researchers with a deeper insight for future work.

Low-resource Setting
To further understand how well REASTAP learns table reasoning skills during pre-training, we con- ducted experiments under the low-resource setting, where we fine-tuned REASTAP on 20% and 5% of downstream task training data.As shown in Table 7, in the low-resource setting, the improve-   ments introduced by REASTAP are often more significant.For example, with only 5% training data of downstream tasks, REASTAP delivers a dramatic improvement of 20.0%, 21.4%, and 11.2% over BART on WIKISQL-WEAK, WIKITQ, and TABFACT, respectively.The results from the low-resource setting show that REASTAP endows BART with generic table reasoning capabilities.

The Scale of Pre-training Corpus
Figure 2 illustrates REASTAP performance on downstream tasks with different pre-training corpus scales.We found that increasing the pretraining corpus generally brings positive effects for all downstream tasks.Furthermore, for simple tasks like WIKISQL-WEAK, the gains by scaling up pre-training corpus are marginal, while for complex tasks like WIKITQ, it shows a positive trend by scaling up the pre-training corpus.

Necessity of Each Reasoning Skill
We investigated the contributions of the 7 reasoning skills to the downstream task performance of REASTAP.We devised 8 variants of REASTAP: one was trained with examples from all reasoning skills, while others were trained with examples without one reasoning skill.For each reasoning skill, we sampled 150K examples from the pretraining corpus.We kept the scale of pre-training corpus the same (i.e., 900K).We chose WIKITQ for experiments, on which BART does not perform well.Results shown in Table 8 demonstrate that all reasoning skills can benefit the model performance on WIKITQ.Furthermore, we find that some reasoning skills, such as counting and temporal comparison, bring more improvements to the model compared to others.
The analysis also helps us understand how the sets of pre-defined reasoning skills are injected during pre-training.When adopting REASTAP to a new downstream task that requires new reasoning skills different from existing seven reasoning skills, one can also inject the new reasoning skill into model during the pre-training in a similar way.Specifically, once the templates for the new reasoning skill are designed, the synthesis pipeline will generate new examples for pre-training.Pretraining REASTAP on these synthetic examples can help model learn the new reasoning skill.

Multi-Task Fine-tuning
We further conducted multi-task fine-tuning experiments to explore whether REASTAP can benefit from the source task.We chose WIKISQL-WEAK and TABFACT for the source task, as their training datasets are relatively rich, and WIKITQ as the target task.Models were first fine-tuned on the source task and then fine-tuned on the target task.As shown in Table 9, multi-task fine-tuning delivers a significant improvement to the target task when initialized by BART; while the improvements are marginal when initialized by REASTAP.This is reasonable because most table reasoning skills acquired by multi-task learning have been injected into the model during the pre-training.

Related Work
Reasoning Over Tables Reasoning over the input context is an important requirement for neural models to be applied in the real world, and especially when the input is structured knowledge such as a table.Several Table QA benchmarks (Pasupat and Liang, 2015;Zhong et al., 2017;Iyyer et al., 2017;Chen et al., 2020c) have been proposed to test systems' capability to conduct different types of reasoning, including numerical, logical or multi-hop reasoning.For Table Fact Verification tasks (Chen et al., 2020b;Aly et al., 2021), the models are required to perform logical inference to verify whether the given statement is entailed or refuted.Furthermore, Table-to-Text (Chen et al., 2020a;Parikh et al., 2020;Nan et al., 2022b) tasks to generate a natural language description of some part of the table based on inferences obtained from facts in the contexts.More recently, numerical reasoning over tabular data in financial domain has also raised increasing attention (Zhu et al., 2021;Chen et al., 2021b;Zhao et al., 2022;Cheng et al., 2022;Li et al., 2022;Zhou et al., 2022).

Conclusion
In this paper, we propose REASTAP, a new table pre-training approach, which injects various pre-defined table reasoning skills into models via learning to generate correct answers of synthetic questions.Compared to previous work which design table-specific architectures, REASTAP is easy to implement and is theoretically applicable to any sequence-to-sequence LM.REASTAP is evaluated over four downstream benchmarks.The experimental results demonstrate that REASTAP achieves new state-of-the-art results on each of them.This includes the improvements on WIKISQL-WEAK denotation accuracy to 90.4% (+1.2%);WIKITQ denotation accuracy to 58.6% (+1.4%);TABFACT accuracy to 84.7% (+0.7%); and LOGICNLG SP-Acc to 54.8% (+4.0%),NLI-Acc to 89.2% (+3.6%).Further analysis demonstrates that REASTAP delivers a significant improvement to BART on the low-resource setting, indicating that our proposed pre-training approach can effectively improve the model's generic table reasoning capabilities.

Limitations
The main limitation of our approash is that we utilized a template-based method to synthesize pretraining corpus.Although such template-based approach ensures the faithfulness of generated QA examples and the diversity of reasoning process required to answer the questions, it limits the semantic diversity of questions.We believe future work could exploit 1) more different types of reasoning skills, such as advanced numerical reasoning skills required in the finance domain (Zhu et al., 2021;Chen et al., 2021b); 2) a more universal synthetic example generation pipeline; 3) extending models to tables with hierarchical structures (e.g., more than one row or column header) (Cheng et al., 2022;Zhao et al., 2022); 4) a more efficient training framework (Biesialska et al., 2020;Yoran et al., 2021) that can update models to learn newlydefined reasoning skills effectively.

Ethical Consideration
Tables used in our synthetic pre-training corpus are collected and extracted from the 02-20-2022 Wikipedia dump 2 , which is publicly available under the Creative Commons Attribution-ShareAlike 3.0 License and the GNU Free Documentation License.The licenses permit us to compose, modify, publish, and distribute additional annotations upon the original content.

Figure 1 :
Figure 1: The illustration of REASTAP pre-training.The tables are crawled from Wikipedia.During preprocessing, we perturb the table row order to alleviate unwanted bias brought by table encoding.The colored cells are relevant facts necessary to answer the given question.Each color corresponds to a different table reasoning skill.And each reasoning skill corresponds to an example generator, which synthesizes QA pairs over tables according to the sampled templates.We model the pre-training task as a sequence generation task and pre-train REASTAP to generate correct answers given the flatten table and synthetic question.
pretraining corpus as complementary QA examples.The sampled pairs were categorised according to their function (e.g., COUNT, SUM).As a result, we obtained a total of 4M pairs of reasoning examples as the pre-training corpus for REASTAP.

Figure 2 :
Figure 2: REASTAP downstream task performance on dev set with different scales of pre-training corpus.
What was the Television Service when the Country was Italy and the Content was Sport?A: Sky OMC Sports, ESPN, Gazzetta TV, ...
Q: Which Franchise, with Owner(s) was Nintendo, has the 5th Total revenue($ US Billion)?A: Pokemon 14.0% Table1: 7 reasoning skills with example for pre-training REASTAP.Variable names indicate permissible instantiations.col denotes a column name, val denotes a cell value, and indices denote that a cell value must originate from the specified column.OPERATOR and ORDINAL correspond to operators and ordinal numeral that are instantiated according to the specific reasoning skill, e.g., for 'Temporal Comparison', ORDINAL is replaced with a reasonable ordinal numeral such as "4th".And CONDITION:i can be 1) a cell value from the i-th column, or 2) number/temporal comparison statement (e.g."later than 1967") if the i-th column is of number or date type.

Table Facts Verification
(Lewis et al., 2020)en et al., 2020b)to evaluate REASTAP performance on Table Facts Verification tasks.Given a table and a statement, TABFACT(Chen et al., 2020b)tries to distinguish whether the statement is entailed or refuted by the table.TABFACT divides its test sets into Test simple and Test complex subsets, where Test complex contains examples requiring more complex table reasoning skills.Furthermore, it selects a small test set Test small with 2K samples for human evaluation.To fine-tune on TABFACT, following BART(Lewis et al., 2020), we applied a binary classifier upon the hidden state of the last token in the decoder for the output.The objective is to generate the verification label L ∈ {0, 1} given the statement s = (s 1 , s 2 , ⋯, s n ) and the input table T :

Table 2 :
Overview of downstream tasks used in this paper.
(Wiseman et al., 2017;Balakrishnan et al., 2019;Parikh et al., 2020;Nan et al., 2022b)ormance on the Faithful Table-to-Text Generation task.Compared with previous Table-to-Text generation benchmarks(Wiseman et al., 2017;Balakrishnan et al., 2019;Parikh et al., 2020;Nan et al., 2022b), which primarily focus on surface-level realizations without much logical inference, LOGICNLG is tasked to generate statements that are logically entailed by the selected table region.Given the serialized input table with its selected columns as T , the objective is to generate a sentence y = (y 1 , y 2 , ..., y n ) that is both fluent and factually correct:y = argmaxn ∏ i=1 P (y i |y <i , T ; θ) NLI-ACC uses a Natural Language Inference (NLI) model to predict entailment relationships.Following Chen et al. (2020a), we used SP-Acc, NLI-Acc as logical-fidelity evaluation metrics; and BLEU-1/2/3 as surface-level evaluation metrics.It is worth

Table Fact Verification
As shown in Table5, REASTAP also obtains a new state-of-the-art accuracy on all test subsets of TABFACT.For example, it surpasses the previous best system by 0.4% on Test simple , and 1.0% on Test complex .

Table 6 :
Performance on LOGICNLG test set.

Table 7 :
Performance on dev set under low-resource setting.Results show the average over 3 random seeds.

Table 8 :
WIKITQ dev set denotation accuracy with examples of different reasoning skills removed from the full model pre-training corpus.Each variant is trained using 900K pre-training examples.

Table 9 :
Dev denotation accuracy of multi-task finetuning on the taget task WIKITQ.
learned from synthetic SQL programs.And Jiang et al. (2022) further pretrained TAPEX over natural and synthetic QA examples to improve the few-shot performance over table QA tasks.Meanwhile, pre-training for Textto-SQL tasks also attracted researchers' attention in recent years.Unlike previous work, we model the pre-training task as a sequence generation task, and inject various table reasoning skills into the model by tasking it to generate the precise answers of reasoning examples.
. For example, Geva et al. (2020) utilized automatically-generated numerical data to inject numerical reasoning skills during pretraining.And Yoran et al. (2021) leveraged largescale WikiPedia resources to automatically generate examples that requires reasoning over multiple facts in the paragraph, and continue pre-training LM on this synthetic corpus.Furthermore, recent works