AttesTable at SemEval-2021 Task 9: Extending Statement Verification with Tables for Unknown Class, and Semantic Evidence Finding

This paper describes our approach for Task 9 of SemEval 2021: Statement Verification and Evidence Finding with Tables. We participated in both subtasks, namely statement verification and evidence finding. For the subtask of statement verification, we extend the TAPAS model to adapt to the ‘unknown’ class of statements by finetuning it on an augmented version of the task data. For the subtask of evidence finding, we finetune the DistilBERT model in a Siamese setting.


Introduction
Tables provide a compact and structured way of presenting information. They are easily interpretable by humans and are widely used in scientific papers, business articles, and even in government reports. At the same time, it is easy to misinterpret tabular data and even use it maliciously to spread misinformation. Thus, systems that can verify facts from tables and locate evidence in them can go a long way in ameliorating issues related to table interpretation. Such systems can also be used for question-answering over large tables, that are difficult to analyze manually, since the task of fact verification with evidence finding is closely related to that of question answering (Jobanputra, 2019).
Table entailment refers to the task of finding whether a sentence is refuted or supported by a given table. Traditionally, it has been considered a binary classification task. However, there could be instances where the table is not capable enough to either refute or accept the given statement, meaning the statement context is "unknown" for the table.
In this paper, we have presented a table entailment setup that extends to three classes: refuted, entailed, and unknown, as a part of the Sem-Tab-Fact task (Wang et al., 2021). The task could be divided into two parts: statement verification, and evidence finding. Given a table and a statement, one has to first determine whether the table entails the statement, refutes the statement, or has insufficient information about it. This forms the subtask of statement verification. If the table entails or refutes the statement, one would also like to know which cells in the table provided evidence for the same. This is the subtask of evidence finding, that can be formulated as a binary classification problem. Each cell in the table is assigned one of the 'relevant' or 'irrelevant' labels depending on whether it provided evidence for the given statement.
Recent years have seen a rise in the use of transfer learning in language processing (Malte and Ratadiya, 2019a) owing to their superior performance. Models based on underlying concepts like attention mechanism and transformers are seeing widespread use across a range of tasks (Malte and Ratadiya, 2019b;Ratadiya and Mishra, 2019). Our findings concurred with this trend as we used similar kinds of architectures as the fundamental blocks in the systems for both the subtasks. For the subtask of statement verification, we modify the recently released TAPAS model  that is pre-trained on the TabFact dataset . The pre-trained model is trained on only two classes: 'entailed' and 'refuted', and is not capable of classifying the 'unknown' samples. The training data provided to us for our task also contains the same two labels. To adapt to the 'unknown' statements, we augment the given data by including random statements from other tables as unknown. We then fine-tune the TAPAS model by extending it to three classes on this augmented data.
Using this approach, we achieve a three-way F1 score of 65.59 and a two-way F1 score of 71.72 on the test data. We ranked 8 th on the official leaderboard for this subtask.
For the subtask of evidence finding, we finetune the DistilBERT model (Sanh et al., 2019) in a Siamese setup (Reimers and Gurevych, 2019). The model is pre-trained on the SNLI dataset (Bowman et al., 2015), the MultiNLI dataset (Williams et al., 2018) and the STS benchmark (Cer et al., 2017). For finetuning, we use the Contrastive loss (Hadsell et al., 2006) as our loss function. For making a prediction, we calculate the cosine similarity between the embeddings computed by the model for both the statement and the cell value. If the cosine similarity value crosses a set threshold, we consider the cell to be relevant to the given statement. Using this approach, we achieve an F1-score of 43.02 on the test data and secured the 7 th rank on the official leaderboard for this subtask.
The source code of our proposed approach for both the subtasks has been made public 1 to encourage usage and improve the reproducibility of the results.

Related Work
Recently, the TabFact dataset was released, which contained 118K human-annotated statements, related to about 16K Wikipedia tables. These statements were annotated for only two labels, 'entailed' and 'refuted', leaving out the 'unknown' cases. A system's ability to distinguish 'unknown' statements from 'entailed' and 'refuted' is quite critical, as one may elusively create seemingly believable statements that are actually 'unknown'.
Recently, there has been work in areas related to table fact verification, especially since the release of the TabFact dataset. Many of these approaches are graph-based in nature (Shi et al., 2020;. The current stateof-the-art on the TabFact dataset is the recently released TAPAS model, which outperforms its predecessor by approximately 6%. Unfortunately, all of these models are quite far from human-level performance, suggesting ample scope for improvement. Little to no work has been done previously on the task of evidence finding, with the closest approaches being related to table-based question answering like WikiTableQuestions (Pasupat and Liang, 2015), WikiSQL (Zhong et al., 2018) and SQA (Iyyer et al., 2017) where getting the answer for a question can be considered analogous to getting evidence for a statement.

Data and Preprocessing
The training dataset for this task contains two kinds of data: • Manual annotations: These are crowdsourced from human annotators and have been validated by a second round of validation.
• Auto-generated annotations: These statements are auto-generated using a random paraphraser and table understanding service (Zheng et al., 2020).
A separate development set was also provided.

Subtask A: Statement Verification
The provided dataset for this task is in XML format. We use a custom parser to convert the data into CSV format, such that each data sample is of the form (table, statement, label), where the table is also a CSV, extracted from the corresponding XML file. No preprocessing is applied to the statements. In general, the table cells can span across multiple rows and columns. To simplify things, we treat a cell spanning multiple columns/rows as multiple cells with the same value. This preserves the logical hierarchy in the table while keeping the table  structure   A table may also have additional metadata accompanying it, like the legend, the caption, and the footer. Our final model does not use this metadata since we observe in Section 5.1.2 that including the metadata has a small negative impact on the model performance.
After the above preprocessing, we have 179, 345 autogenerated and 4506 manual data samples.

Subtask B: Evidence Finding
For this subtask, we prepare the data in the form (cell text, statement text, label), where the label can be 'relevant' or 'irrelevant'. We do not encode the table's structural information and focus only on the semantic similarity between the statements and the cell texts.
An important point to note here is that only a few of the cells in the table actually contribute to the evidence for each statement. The average ratio of the number of relevant samples to total samples as observed in the development set for each statement is around 0.097. This causes a substantial class imbalance since a meager 7% of the total samples are 'relevant'. To attend this problem, we undersample the 'irrelevant' samples by removing some irrelevant samples for each statement from the data. We try three ratios for undersampling and compare the percentage of relevant statements after undersampling in Table 1.  In the original data, the average ratio of relevant to total samples is around 9.7%, so we would want our data to reflect a similar ratio. Therefore, we use the data undersampled using the first approach for further analysis. Hence, the total number of samples after undersampling stand at 10.24M.

Data Augmentation
The given data has no 'unknown' statements; thus, we augment the data to introduce samples of this class label. Let S denote the set of all statements in our data. Let S e denote the set of all entailed statements in our data. Let S r denote the set of all refuted statements in our data. Then S = S e ∪ S r .
For a given table t, let S t denote the set of all statements associated with that table. We first calculate n e (the number of entailed statements for that table) and n r (the number of refuted statements for that table). We then calculate n u (the number of unknown statements to add) as follows: n u = max(min(n e , n r ), 1) (1) After this, we randomly select n u statements from S\S t . This forms the set of 'unknown' statements for the table t. We do this for all tables in the data.
The above augmentation scheme ensures that the resultant augmented data has an almost equal overall distribution among the three classes. It also ensures that evenness across tables is maintained, as sufficient unknown statements (at least one) are added for each table, and not just overall. After performing the above augmentation scheme, we get 85, 296 unknown statements for the autogenerated data and 1, 637 unknown statements for the manual data.

Subtask A: The TAPAS Model
TAPAS (Herzig et al., 2020) is a recently released BERT (Devlin et al., 2019) based model that encodes the structural information via column, row and rank embeddings. Eisenschlos et al. extended TAPAS for binary entailment using the TabFact dataset. We use the base variant (12 encoder blocks) of this TAPAS model, which has been pretrained on the TabFact dataset. We use the base variant over the large variant (24 encoder blocks) due to limitations on computation power and due to the superior performance of the base variant as shown in Section 5.1.2.
For extending the TAPAS model to support the unknown class, we replace the two-neuron linear layer in the pre-trained model by a randomly initialized linear layer consisting of three neurons as the output layer. We then finetune this modified model on the augmented data containing three classes. Figure 2 shows the overview of the modified model. It consists of an embedding layer that concatenates the positional and semantic information of a cell value, the encoder block consisting of transformer layers and subsequent layers that are required for classification output. The core layers of the modified model are in accordance with the original TAPAS model (Herzig et al., 2020;.

Subtask B: Siamese DistilBERT
For the evidence finding subtask, we finetune the DistilBERT model on the undersampled data. The Figure 2: The modified TAPAS architecture. 'Classifier' is just a linear layer with three neurons model was pre-trained with mean pooling on the SNLI dataset, the MultiNLI dataset, and the STS benchmark. The mean pooling is applied in a Siamese setting using the Contrastive loss as the loss function.
Let s denote a statement, let c denote the cell, let r denote the relevancy/label (this is either 0 or 1) and let M(x) denote the embedding computed the model for an input x. Then, the distance d and the Contrastive loss L(d, r) is defined as follows: (2) Here, m denotes the 'margin' value, which ensures that the dissimilar pairs contribute to the loss only if their distance is within this margin. For making inferences, we first use our finetuned model to compute the sentence embeddings for the statement and the cell text and then compute the cosine similarity between the two. Hyperparameter tuning and other details about the models are discussed in Section 4.

Subtask A
For training, we first merge the autogenerated and manual augmented data. We then perform an 80 − 20 split on this merged data to get our training and validation sets. We then fine-tune on the training data and validate on the validation set.
The 'TAPAS Encoder' (Figure 2) consists of a stack of 12 encoder blocks (similar to the BERT Base). For finetuning, we freeze the entire model, except the last three encoder blocks, the 'TAPAS Pooler', and the final classification layer.
Categorical Cross-Entropy is used as the loss function. We use a batch size of 32 and finetune for 3 epochs using the AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 5 × 10 −5 .
Before making the final submission, the training and development sets are merged together and the model is trained for one more epoch.

Subtask B
We take a 500K sized randomly selected sample from the undersampled data and perform an 80 − 20 split on this to get our training and validation sets. We keep the margin m as 0.5 and finetune the entire model, with no parameters frozen, for a single epoch using a batch size of 64. For making the inferences from the cosine similarity, we use an empirically selected threshold of 0.3.

Libraries and Tools
Google Colab 2 was used to perform all the experiments. The time taken per epoch for any model did not exceed 10 hours. The GPUs automatically allotted by Colab kept varying between Tesla T4, Tesla P100-PCIE-16GB, and Tesla K80. Py-Torch 3 is used as the central framework for both the tasks. For subtask A, Huggingface's Transformers library 4 was used to load the pre-trained TAPAS model. For subtask B, we use the SentenceTransformers framework 5 to load the pretrained Siamese DistilBERT model.

Official Metrics
The original pre-trained TAPAS (Base) model, without any finetuning, achieves a two-way F1 score of 62.56. Our final finetuned model achieves  Figure 3 shows the corresponding confusion matrix. We ranked 8 th on the official leaderboard for subtask-A.
From the confusion matrix, we observe that the most confusing cases for our model are true 'unknown' statements, which get classified as 'entailed'. On manual analysis of such statements, we observe that many of these have a considerable textual overlap with the table. This misclassification seems to be a consequence of our augmentation scheme. The statements that we add as 'unknown' while augmentation have almost no textual overlap with the table data since they were sourced from other tables. Although, in general, as observed in the test data, unknown statements can broadly be classified into two kinds: 'related' and 'unrelated'. The former being unknown statements that are related in a semantic or a directly textual way to the table's data, and the latter are not related to the table's data in any way. The following example should make this clear.   Table 2, a 'related' unknown statement will be "A total of 10L of silage, straw, peat or combo was distributed". The boldfaced words overlap directly with the table's contents While, an 'unrelated' unknown statement can be "Earth revolves around the Sun". Thus, both of these sub-classes of the 'unknown' class are important. Synthesizing 'unrelated' unknown statements is a simple task, whereas synthesizing 'related' statements without any additional information seems to be a fairly non-trivial task.
We also observe that wrongly classified statements are, on average, 8.8 characters longer than correctly classified statements. Figure 4 shows this histogram.   We observe that including the metadata worsens the accuracy. A possible reason could be our way of including the metadata as rows. There may be other better ways of doing this, which may further improve the accuracy.

Effect of Model Size
We try out two variants of TAPAS: Base and Large. The Base variant has 12 encoder blocks, and the Large variant has 24, similar to BERT Base and BERT Large. On the TabFact dataset, the Base variant is only about 0.24 points behind the Large one. The following tests are performed on the original, unaugmented data, containing two classes 6 . We merge the original manual and autogenerated statements into a single dataset and perform an 80 − 20 split.
Variant Validation Accuracy Base 0.81 Large 0.71 Table 4: Effect of finetuning a larger model

Subtask B
Our final model achieves an F1 score of 43.02. Figure 5 shows the corresponding confusion matrix. As evident from the confusion matrix, there is a lot of room for improvement, especially in handling the 'relevant' statements. We ranked 7 th on the official leaderboard for subtask-B.

Conclusion
Thus we have presented a statement verification and evidence finding setup for tables. For subtask-A, we extended the TAPAS model to adapt to the 'unknown' statements. For subtask-B, we used a semantic approach for evidence finding. Our results for subtask-A show the problems encountered while generating and working with the 'unknown' statements. For subtask-B, our results show the effect of taking only the semantic information into 6 Training the Large variant on augmented data exceeds the time limit of Google Colab (12 hours) account. An important future prospect for subtask-A would be to find a more effective way of generating the 'unknown' statements. For subtask-B, utilizing the table's available structural information to improve the results seems to be a promising prospect.