Volta at SemEval-2021 Task 9: Statement Verification and Evidence Finding with Tables using TAPAS and Transfer Learning

Tables are widely used in various kinds of documents to present information concisely. Understanding tables is a challenging problem that requires an understanding of language and table structure, along with numerical and logical reasoning. In this paper, we present our systems to solve Task 9 of SemEval-2021: Statement Verification and Evidence Finding with Tables (SEM-TAB-FACTS). The task consists of two subtasks: (A) Given a table and a statement, predicting whether the table supports the statement and (B) Predicting which cells in the table provide evidence for/against the statement. We fine-tune TAPAS (a model which extends BERT’s architecture to capture tabular structure) for both the subtasks as it has shown state-of-the-art performance in various table understanding tasks. In subtask A, we evaluate how transfer learning and standardizing tables to have a single header row improves TAPAS’ performance. In subtask B, we evaluate how different fine-tuning strategies can improve TAPAS’ performance. Our systems achieve an F1 score of 67.34 in subtask A three-way classification, 72.89 in subtask A two-way classification, and 62.95 in subtask B.


Introduction
There has been extensive work on verifying if a given textual context supports a given statement. Even though tables are also widely used to convey information, especially in scientific texts, there has been comparatively less work on verifying if a given table supports a statement. To this end, Se-mEval 2021 Task 9 (Wang et al., 2021) focuses on statement verification and evidence finding for tables from scientific articles in the English language. The task is divided into two subtasks -A and B. The aim of subtask A is to classify whether a given * The authors have contributed equally. statement is entailed or refuted according to the given table and associated table metadata (such as captions and legends) or whether the statement's truth is unknown as it cannot be determined from the table. The aim of subtask B is to classify each cell in the table as relevant or irrelevant in determining whether the statement is entailed or refuted from the tabular evidence (the truth value of the statement is also provided).
Our systems use TAPAS (Herzig et al., 2020) trained with intermediate pre-training  for both the subtasks. For subtask A, we fine-tune TAPAS after adding a three-way classification head on top for classifying the statement as entailed/refuted/unknown. We also evaluate how transfer learning and standardizing tables to have a single header row can improve TAPAS' performance. Due to the similarity between subtask B and table question-answering (which involves cell selection or cell selection followed by aggregation), we use the TAPAS architecture previously used for table question-answering and fine-tune it to select the relevant cells. We also evaluate how different fine-tuning strategies can improve TAPAS' performance on evidence finding.
Our systems achieve an F1-micro score of 67.34 in subtask A and 72.89 in subtask A if the unknown statements are not considered while calculating the metrics (however, classifying entailed/refuted statements as unknown is still penalized). Our submitted system achieves an F1 score of 62.95 in subtask B. During the post-evaluation phase, we modified our system and achieved an F1-score of 65.48 in subtask B.
The code for our systems is available at https: //github.com/devanshg27/sem-tab-fact.

Background
Verifying if the given textual evidence supports a given statement is a fundamental natural language processing problem. It has been extensively studied under different tasks such as RTE (Recognizing Textual Entailment) (Dagan et al., 2006), NLI (Natural Language Inference) (Bowman et al., 2015), FEVER (Fact Extraction and VERification) (Thorne et al., 2018). In recent years, largescale pre-trained models (Devlin et al., 2019;Peters et al., 2018;Liu et al., 2019) have dominated these tasks and have achieved close-tohuman performance. NLVR (Suhr et al., 2017) and NLVR2 (Suhr et al., 2019) focus on verifying a statement given an image as evidence. TAB-FACT  focuses on verifying a statement given a table from Wikipedia 1 as evidence.
Along with releasing TABFACT, Chen et al. (2020) also discuss two promising approaches for tabular fact verification, Latent Program Algorithm(LPA) and Table-BERT. LPA is a semantic parsing approach that parses statements into programs (logical forms) and executes the programs against the table to predict the entailment decision. Most of the current models (Zhong et al., 2020;Shi et al., 2020; for TABFACT are semantic parsing approaches similar to LPA. Table-BERT encodes the linearized tables and statements using BERT-based models and directly pre-dicts the entailment decision.  inject table structural information into the mask of the self-attention layer of BERT-based models, which helps the model learn better table representations. TAPAS (Herzig et al., 2020) extends BERT's architecture to capture the tabular structure, and it showed competitive performance on various table question answering datasets: SQA (Iyyer et al., 2017), WTQ (Pasupat and Liang, 2015) and WikiSQL (Zhong et al., 2017).  add an intermediate pre-training step before the fine-tuning step to TAPAS and show that it achieves state-of-the-art results on TABFACT and SQA (Iyyer et al., 2017). Their model is still 8 points behind human performance on TABFACT since tabular fact verification involves table understanding and complex reasoning.
While TABFACT also focuses on fact verification using tables as evidence, it focuses on tables from Wikipedia, whereas SemEval-2021 Task 9 (SEM-TAB-FACTS) instead focuses on tables from scientific articles and has a subtask related to evidence finding. Also, TABFACT did not have a neutral/unknown class, which they left out because of low inter-worker agreement due to confusion with refuted class. Figure 1 shows an example of a table from the SEM-TAB-FACTS dataset and the labels for the two subtasks.

System Overview
In this section, we provide a general overview of our systems for the two subtasks. We use TAPAS for both subtasks.

Subtask A: Statement Verification
Pre-processing Since TAPAS only works on tables with single cells (cells which do not span multiple columns/rows) only, we first convert the tables with multi-row/multi-column cells to tables with only single cells by duplicating the value of the cell in every single cell the multi-row/multi-column cell spans. An example of the pre-processing is shown in Figure 2a.

Header Standardization
We experiment with standardizing the pre-processed tables with multirow headers to tables with a single header row since TAPAS was pre-trained on single header tables and TABFACT (which we want to use for transfer learning) also contains single header tables. We first predict the number of header rows using the following rules: 1. In many pre-processed tables, we found that the left-most column contained row names, and either (a) all the header cells in the leftmost column were empty, or (b) the cell value at the top-left corner was repeated in all the header cells below it, or (c) the cell at the topleft corner was not empty, but the header cells below it were empty. Based on these cases, we initially estimate the number of header rows as the number of rows at the top, such that all cells in the left-most column in those rows are either empty or have the same value as the cell at the top-left corner.
2. We also found that in many cases, there were multi-column cells in the header, which had more specific sub-headers in the rows below.
To handle these cases, we increment the estimate of header rows until no two adjacent columns have the same header cell values.
We merge the predicted header rows into a single row by joining each column's header cell values into a single cell with a newline as a separator. An example of header standardization is shown in Figure 2b. We were provided with HTML versions of the tables in the training and development set. We compare our predictions against the <thead> tags in the HTML tables to analyze our header prediction system's performance. The results are shown in Table 1. We also find that in almost all of the cases, the predictions are either correct or have an error of ±1.
To study the effect of header standardization, we will train all our systems with and without header standardization.
Model Our model takes the following input: [CLS] <statement> [SEP] <flattened table>, which is tokenized using the standard BERT tokenizer. We compute the class probabilities using a linear layer with a softmax activation function on top of the output of the [CLS] token, as shown in Figure 3a. We use the weighted cross-entropy loss, which helps in handling imbalance in the class sizes: Where y ik denotes the ground truth label, it is 1 if k is the true class label of the i th token, and 0 otherwise, y ik is the corresponding model probability prediction and w k is the weight for class k. We set w k as the size of the biggest class divided by the size of class k.
To analyze how transfer learning can improve performance, we compare the following approaches: • TAPAS-stf: We use the publicly available TAPAS checkpoint which has been pretrained with a masked language modeling objective and fine-tune it on the SEM-TAB-FACTS dataset provided by the task organizers.
• TAPAS-tf: As a baseline, we directly use the publicly available TAPAS checkpoint, which had been fine-tuned on TABFACT without any further fine-tuning on SEM-TAB-FACTS. Since TABFACT has only entailed/refuted labels, this model is a binary classifier and does not predict the unknown class's probabilities.
• TAPAS-tf-stf: We use the publicly available TAPAS checkpoint, which had been finetuned on TABFACT and further fine-tune it on the SEM-TAB-FACTS dataset released by the task organizers. This is our submitted model for subtask A.

Subtask B: Evidence Finding
Pre-processing and Header Standardization We convert the multi-row/multi-column cells and standardize the header rows as discussed in Section 3.1. The relevant/irrelevant labels of the multi-row/multi-column cells are duplicated to all the single cells they span. We consider the relevant/irrelevant labels only for the cells of the nonheader rows as TAPAS does not make predictions for header cells. Based on the performance of header standardization in subtask A (which we will discuss in Section 5), we standardize headers for all our models in this subtask.
Model Our model takes the following input: [CLS] <statement> [SEP] <flattened table>, which is tokenized using the standard BERT tokenizer. We show the architecture of our model in Figure 3b. Our model computes token-level logits using a linear layer on top of each token's last hidden state output, which are used to compute cell-level logits by averaging the logits of the tokens in each cell. The probability of selection for each cell is calculated from the cell-level logits using the sigmoid function. We use the weighted binary cross-entropy loss which helps in handling class imbalance: Where y i denotes the ground-truth label, it is 1 if the i th token is part of any relevant cell, and 0 otherwise, y i is the corresponding model probability prediction, and w p denotes the weight of the positive (relevant) class. We set w p to 10.  Due to the similarity of evidence finding with table question-answering, we use the publicly available TAPAS checkpoint, which was fine-tuned in a chain on SQA, WikiSQL, and finally WTQ. We compare the following fine-tuning strategies: • WTQ-base: As a baseline, we fine-tune our model directly for relevant cell selection on SEM-TAB-FACTS.
• WTQ-statement: We again fine-tune the model for relevant cell selection on SEM-TAB-FACTS, but we try to include the information on whether the statement was entailed/refuted by modelling the statement as 'Which cells entail "<statement>"?' or 'Which cells refute "<statement>"?'. <state-ment> denotes the original statement.
• WTQ-separate: We fine-tune two separate models, one which predicts the relevant cells for entailed statements and another one for refuted statements. This is our submitted system for subtask B.
During the post-evaluation phase, we experimented with the publicly available TAPAS checkpoint, which was fine-tuned on TABFACT. Similar to the systems described above, we compare three systems based on this checkpoint: TABFACT-base, TABFACT-statement, and TABFACT-separate.
Post-Processing We further apply postprocessing steps to obtain the final prediction from the cell classification. To predict the header's relevant cells, we select the header cells for any column with cells selected as a relevant cell. We label multi-row/multi-column cells as relevant if any of the single cells they span are predicted as relevant.

Data Description
We used the dataset provided by the task organizers for both subtasks. We did not use the table metadata in our systems.
For subtask A, dataset statistics and the official splits are shown in Table 2a. The provided training sets do not have any statements of the unknown class. So, we used the manually annotated training set to create a training set with unknown statements. Each statement of the manually annotated training set was added as an unknown statement to a different table chosen randomly. We used this dataset for training all our models for subtask A.
For subtask B, dataset statistics and the official splits are shown in Table 2b. We use the autogenerated training set for training all our models in subtask B.

Implementation
For the implementation of our systems, we used the HuggingFace Transformers 2 library (Wolf et al., 2020) and we used the AdamW optimizer available in PyTorch 3 (Paszke et al., 2019) with the default parameters (learning rates are specified below). All models were fine-tuned using a single Nvidia GeForce RTX 2080 Ti GPU.
We used the base variant of TAPAS, which has a hidden dimension of 768 in all our models.   For subtask A, we first fine-tuned the classifier head with the TAPAS layers frozen for 3 epochs with a learning rate of 1 −5 and then fine-tuned the whole model for 10 epochs with a learning rate of 1 −6 . We used a batch size of 8. We saved a checkpoint every 100 steps and selected the best checkpoint based on the validation set performance.
For subtask B, we fine-tuned the whole model for 5000 steps with a learning rate of 1 −6 . We used a batch size of 8. We saved a checkpoint every 50 steps and selected the best checkpoint based on the validation set performance.

Evaluation Metrics
In subtask A, two evaluation metrics are used. The first evaluation metric used is the standard F1-micro score for three-way classification. The second metric again calculates the F1-micro score but does not consider statements with their ground truth label as the unknown class for evaluation; however, classifying the entailed/refuted statements as unknown is penalized.
In subtask B, the evaluation metric used is the standard F1 score with relevant cells as the positive class. If multiple minimal sets of cells can be used to determine the statement's truth value, the dataset contains all of these versions. The score for that statement is calculated by comparing the prediction against each ground truth version and considering the highest score.

Results
Subtask A The performance of the various systems we considered in subtask A is shown in Table 3. Header standardization improves the performance of all the systems we compared. Transfer learning from TABFACT also improves the performance of our systems. Surprisingly, TAPAS-tf without any fine-tuning on SEM-TAB-FACTS has a better two-way F1-micro score than TAPASstf. This shows us the potential of transfer learning from TABFACT in subtask A. From the confusion matrix shown in Figure 4a, we observe that our model struggles with the unknown class and often misclassifies it as refuted.
Subtask B The performance of the various systems we considered in subtask A is shown in Table 4. Modifying the statement to include entailed/refuted class information leads to a small drop in performance for the models fine-tuned on question-answering earlier and led to a small increase in performance in models fine-tuned on TAB-FACT. Separate models for entailed/refuted statements perform the best among the systems we considered. It significantly improves the performance on entailed statements, with a little drop in performance on refuted statements. Surprisingly, we observe that transfer learning from TABFACT performs better than transfer learning from WTQ, even though it is a cell selection task. We believe this is because the model has to predict the cells that can be used as evidence for table entailment. The token-level embeddings of the model fine-tuned on TABFACT are better for this task than the model fine-tuned on WTQ, which is instead a questionanswering dataset.  Figure 4: Confusion matrices of the test set predictions by our best model for each subtask. The percentages show the ratio of the target class, which was predicted as that class.
Long Inputs The maximum number of tokens supported by our system is 512. In sequences longer than 512 tokens, the tables are truncated row by row to fit in 512 tokens. We compare our system's performance on these long sequences and sequences that fit within 512 tokens. The results are shown in Table 5. We find a significant drop in performance on sequences longer than 512 tokens which had to be truncated.

Conclusion
In this paper, we presented our approach for fact verification and evidence finding for tabular data in scientific documents. We show that transfer learning from TABFACT and standardization of the tables to have a single header helps improve our system's performance. We also show that having separate evidence finding models for entailed/refuted statements helps improve our system's performance in the second subtask.
We also find that our model has a significant drop in performance on large tables since they are truncated to fit in the 512 tokens, the maximum number of tokens supported by TAPAS.
In future work, we would like to experiment with table pruning methods like Heuristic entity linking  or Heuristic exact match  so that the statement and table can fit in 512 tokens. Our systems did not use the table metadata while making the predictions. In the future, we would also like to explore extending the model to encode table metadata along with the  table.