TAPAS at SemEval-2021 Task 9: Reasoning over tables with intermediate pre-training

We present the TAPAS contribution to the Shared Task on Statement Verification and Evidence Finding with Tables (SemEval 2021 Task 9, Wang et al. (2021)). SEM TAB FACT Task A is a classification task of recognizing if a statement is entailed, neutral or refuted by the content of a given table. We adopt the binary TAPAS model of Eisenschlos et al. (2020) to this task. We learn two binary classification models: A first model to predict if a statement is neutral or non-neutral and a second one to predict if it is entailed or refuted. As the shared task training set contains only entailed or refuted examples, we generate artificial neutral examples to train the first model. Both models are pre-trained using a MASKLM objective, intermediate counter-factual and synthetic data (Eisenschlos et al., 2020) and TABFACT (Chen et al., 2020), a large table entailment dataset. We find that the artificial neutral examples are somewhat effective at training the first model, achieving 68.03 test F1 versus the 60.47 of a majority baseline. For the second stage, we find that the pre-training on the intermediate data and TABFACT improves the results over MASKLM pre-training (68.03 vs 57.01).


Introduction
Recently, the task of Textual Entailment (TE) (Dagan et al., 2005) or Natural Language Inference (NLI) (Bowman et al., 2015) has been adapted to a setup where the premise is a table (Chen et al., 2020;. The Shared Task on Statement Verification and Evidence Finding with Tables (SemEval 2021 Task 9, Wang et al. (2021)) follows this line of work and provides a new dataset consisting of tables extracted from scientific articles and natural language statements written by crowd workers. In this paper, we discuss a system for tackling task A, which is a multi-class classification task that requires finding if a statement is Step 1: Pretrain with MLM Step 2: Intermediate Pretrain with Counterfactual+Synthetic data Sum of wins when Country is U.S is 7 → Entailed Greg Steve has the highest earnings → Refuted Step 4a: Fine-tune neutral detector Real SemTabFact Instance → Relevant Corrupted SemTabFact Instance → Neutral Step 4b: Fine-tune binary entailment Real SemTabFact Instance → Entailed / Refuted Step 3: Intermediate Pretrain on TabFact Real TabFact Instance → Entailed / Refuted Figure 1: Overview of the training pipeline use in our system. We use intermediate pre-training on Counter-factual+Synthetic data  and then fine-tune on TABFACT (Chen et al., 2020). entailed, neutral or refuted by the contents of a table. The training set contains only entailed and refuted examples and requires data augmentation to learn the neutral class. Additionally, this data set is composed of English language data and requires sophisticated contextual and numerical reasoning such as handling comparisons and aggregations.
A successful line of research on table entailment (Chen et al., 2020; has been driven by BERT-based models . These approaches reason over tables without generating logical forms to directly predict the entailment decision. Such models are known to be efficient on representing textual data as well reasoning over semi-structured data such as tables. In particular, TAPAS-based models (Herzig et al., 2020) that encode the table structure using additional embeddings, have been successfully used to solve binary entailment tasks with tables . To address multi-class classification entailment we decompose the main task into two sub-tasks and use two TAPAS models as described in Figure 2. A first model classifies the statement into neutral or non-neutral, and a second into entailed or refuted. The two models are learned separately: we created artificial neutral statements to fine-tune the first model. Examples are extracted by randomly pairing statements and tables from the SEMTABFACT training set. We also generate harder examples by creating new tables from the original tables by removing columns that contain evidence to refute or entail the statement. This procedure is discussed in Section 3.
We follow  and pre-train the two TAPAS models with a MASKLM objective  and then with counterfactual and synthetic data as shown in Figure 1. We additionally fine-tune both models on the TABFACT dataset. Details are given in Section 4.
We find that our artificial neutral statement creations out-performs a majority baseline and that pre-training help for both the first and the second stage. Our best models achieve 68.03 average micro f1-score on the test set.

Related Work
Entailment on Tables Recognizing textual en tailment (Dagan et al., 2010) has expanded from a text only task to incorporate more structured data, such knowledge graphs (Vlachos and Riedel, 2015), tables (Jo et al., 2019; and images (Suhr et al., 2017(Suhr et al., , 2019. The TABFACT dataset (Chen et al., 2020) for example, uses tables as the premise, or source of information to resolve whether a statement is entailed or refuted. The TAPAS architecture introduced by Herzig et al. (2020) can be used to obtain transformer-based baselines, as shown in , by using special embeddings to encode the table structure. ; Chen et al. (2020) also use BERT like models but obtain less accurate results due possibly to not using table-specific pre-training.
Intermediate Pre-training Our system relies on intermediate pre-training, a technique that appears in different forms in the literature. Language model fine-tuning (Howard and Ruder, 2018), or domain adaptive pre-training (Gururangan et al., 2020) are useful applications for domain adaptation. In a similar manner than Pruksachatkun et al. (2020), we use the Counterfactual+Synthetic tasks from  to improve the discrete and numeric reasoning capabilities of the model for Table entailment.
Synthetic data The use of synthetic data to improve learning in NLP is ubiquitous (Alberti et al., 2019;Lewis et al., 2019;Wu et al., 2016;Leonandya et al., (Kaushik et al., 2020;Gardner et al., 2020) (Wang et al., 2021) statistics. The training data for the first stage was created from the crowdsourced training data using artificial neutral statements created by deleting columns with evidence or swapping statements randomly. For the second stage we use the crowdsourced training data.
classifier. TAPAS (Herzig et al., 2020;) is a variation of BERT , extended with special token embeddings that give the model a notion of the row and column a token is located in and what is its numeric rank with respect to the other cells in the same column.

Pre-training
The original TAPAS model (Herzig et al., 2020) was pre-trained with a Mask-LM objective  on tables extracted from Wikipedia. It was later found ) that its reasoning capabilities can be improved by further training on artificial counter-factual entailment data. This led to substantial improvements on the TABFACT dataset (Chen et al., 2020), a binary table entailment task similar to SEMTABFACT. On that dataset the test set accuracy for a BERT-basesized model improved from 69.6 to 78.6. In this work, we use models fine-tuned on TABFACT as the foundation for both stages. We also experimented with using models fine-tuned on INFOTABS  and SQA (Iyyer et al., 2017) as the initial models but did not find that to achieve better accuracy. The overall pre-training strategy is described in Figure 1, where we also show how we use these checkpoints to use the two classification models described below.

Neutral Identification Stage
As discussed, the first stage of the system identifies if a statement is neutral. Training a system for this task is challenging as the SEMTABFACT training data does not contain neutral statements. We therefore created artificial neutral statements from two sources. Following the recommendation of the shared-task organizers, we created neutral statement by randomly pairing statements from the training set with new tables. Additionally, we created neutral statements by identifying columns that contained evidence for deciding whether a statement is entailed and then randomly removing one of these columns. Our assumption is that it should not be possible to decide whether the statement is entailed when an evidence column has been removed. We do not remove the first column of a table since that often contains the name of the row entries. In order to detect the columns containing the evidence, we trained an ensemble of 5 TAPAS QA models on the automatically generated SEMTABFACT training set. Note that the auto-generated data is generated from templates and in contrast to the crowdsourced training data does have evidence cell annotations. The models are trained to predict the binary entailment decision as well as the evidence cells at the same time, and are initialized using a TAPAS model fine-tuned on SQA. The model is trained to predict the binary entailment decision as well as the evidence cells at the same time. Our models take as input We use the same hyper-parameters as SQA (as discussed in Herzig et al. (2020)). We then run these models over the crowdsourced training data and for all examples where the majority of the models correctly predicts the entailment label, we extract all columns for which a majority of the ensemble predicted at least one evidence cell. Evaluation on the SEMTABFACT development set showed that the precision of this column selection process is 0.87 (87% of the extracted columns contain a reference cell). For each column, we then create a new artificial neutral example by removing the

Entailment Stage
Training the entailment stage is rather straightforward, we train the model on the crowdsourced training data using the same hyper-parameters as .

Calibration and Ensemble
As our training data for stage 1 is balanced but the development data is skewed we find it to improve accuracy if we trigger for examples with a logit larger than 4.0 (rather than 0.0). Empirically we also find the threshold of 4.0 to work better for the second stage. This could be explained by the fact that the development set has a different label distribution than the training set.
We train 5 models per stage and use them as an ensemble. The ensemble score is defined as the median of all the model scores. Using the median worked better than the mean and voting in preliminary experiments.

Experimental Setup
In this section we explain the SEMTABFACT task and dataset and give additional details about the experimental setup we used.
The SEMTABFACT dataset consists of statements and tables from the scientific literature. It is much smaller than similar datasets such as TAB-FACT (Chen et al., 2020) and INFOTABS . It is note-worthy that the training set only contains entailed and refuted statements while the dev and test set also contain neutral (unknown) statements. The statements were written by crowd workers, which presented with 7 different types of statements were instructed to write one statement of each type. The types of statements were using aggregation, superlatives, counting, comparatives, unique counting and the usage of the caption or common-sense knowledge.
The main metric of the task is the micro f1-score computed over the statements belonging to a table. The 3-way score takes all statements into account while the 2-way score is restricted to refuted and entailed statements.

Results
Table 2 compares our system to multiple baselines. Unless stated otherwise all baselines have been trained with the same neutral data generation as discussed above and for 20,000 steps. All numbers are based on 5 independent model runs. For all setups we report the median of the individual runs as well as the results for a system based on the median logit of the 5 models. We report error margins for the medians as half the inter-quartile range.
Looking at the first stage of the system in Table  2, we see that the system based on TABFACT is the best choice for the initialization, out-performing a simple BERT model as well as models trained with only the mask-lm and intermediate pre-training on both dev and test ensemble accuracy. However, the model trained on the intermediate data gives higher median dev and test accuracy (e.g. 72.12 vs 70.76).
With respect to the data generation we observe that any kind of neutral data generation outperforms the majority baseline. Combining the column removal and random statements yields the best results. The drop in the 2-way metrics going from the majority Stage 1 model to a learned model is expected as that metric ignores all neutral statements in the eval set.
On the second stage of our system (Table 2), we see that a TAPAS model based on TABFACT outperforms the other baselines by a bigger margin than for Stage 1. For example, a model based on only MASKLM pre-training achieves 57.01 test f1 score while the TABFACT-based modle achieves 68.03. We also found that for this stage there is a more pronounced difference between BERT and MASKLM (54.15 vs 57.01) and MASKLM and intermediate pre-training (57.01 vs 66.43). Table 5 in the appendix shows the results for different number of steps and thresholds showing that results can be slightly tweaked by tuning them. Table 3 shows that the recall and precision on the neutral class are 37.6 and 71.4, respectively. Inspecting some instances of false positives, we find that the system is quite easily fooled; for example classifying the statement "The lowest Factor 8 is 0.027" as non-neutral for a table that has 5 columns labeled as Factor 1 to 5. False negatives are sometimes caused by failing to map words with typos ("paramters" vs "parameters") or abbreviations ("measurement errors" vs "ME"). Adding harder examples of neutral statements to the training set could potentially further improve the identification. We also see that the recall on the refuted class (74.3) is lower than the recall of the entailed class (85.2) while there precision values are similar.

Analysis
In Table 4 we construct mutually excluded groups of the validation set. Each set is identified   Table 4: Accuracy and total error rate (ER) for different question groups derived from the same word heuristics defined in . The baseline is simple class majority and the error rate in each group is taken with respect to the full set. Comparatives show the biggest margin for future improvements comparing with the overall system accuracy.
by specific keywords appearing in the statement, for example Comparatives must contain "higher", "better", "than", etc. The full list is defined in the appendix of . We observe that comparatives and aggregations have the largest total error rates, meaning that the biggest gains in overall accuracy can be made by improving those reasoning skills. Between these two, Comparatives have the lowest in-group accuracy. Table 6 and  Table 7 in the appendix show the some anaylsis for Stage 1 and Stage 2, respectively. The trend for Stage 2 is similar to the overall trend whereas Stage 1 accuracy is relatively stable across the different groups except for comparatives where the accuracy drops from 87% overall to 81%.
Another class of examples with relatively low accuracy are statements around unique counting. We find that statements containing the word different have an accuracy of 51.3 (vs. 71% overall) and account for 3.4 percentage points of the total error rate. Examples include "There are six different classes" and "They have ten different parameters".

Conclusion
We presented our contribution to the SEMTABFACT task (Wang et al., 2021) on table entailment. Our system consists of two stages that classify statements into non-neutral or neutral and refuted or entailed. Our model achieves 68.03 average micro f1-score on the test set. We showed that our procedure for creating artificial neutral statements improves the system over a majority baseline but results in a relatively low recall of 37.6. Other methods for creating harder neutral statements might further improve this value. In line with , we find that pre-training on intermediate data improves the system accuracy over a system purely pre-trained with a MASKLM objective. While these initial results look promising, we find that the model struggles with statements that involve complex operations such as comparisons and unique counting.