Structural Encoding and Pre-training Matter: Adapting BERT for Table-Based Fact Verification

Growing concern with online misinformation has encouraged NLP research on fact verification. Since writers often base their assertions on structured data, we focus here on verifying textual statements given evidence in tables. Starting from the Table Parsing (TAPAS) model developed for question answering (Herzig et al., 2020), we find that modeling table structure improves a language model pre-trained on unstructured text. Pre-training language models on English Wikipedia table data further improves performance. Pre-training on a question answering task with column-level cell rank information achieves the best performance. With improved pre-training and cell embeddings, this approach outperforms the state-of-the-art Numerically-aware Graph Neural Network table fact verification model (GNN-TabFact), increasing statement classification accuracy from 72.2% to 73.9% even without modeling numerical information. Incorporating numerical information with cell rankings and pre-training on a question-answering task increases accuracy to 76%. We further analyze accuracy on statements implicating single rows or multiple rows and columns of tables, on different numerical reasoning subtasks, and on generalizing to detecting errors in statements derived from the ToTTo table-to-text generation dataset.


Introduction
The rapid growth in the amount and sources of online textual content has raised concerns about misinformation and its potential harmful impacts on society when quickly spread to a massive audience. For example, a study on enhancing medical education with Wikipedia in 2015 found that 97% of medical students completing the survey disclosed that they found mistakes in Wikipedia medical entries (Herbert et al., 2015). Concerns about mis-information have stimulated extensive research on automatic fact verification, i.e., verifying whether a given textual statement is entailed or refuted by the given evidence. While most existing fact verification work focuses on unstructured textual evidence, fact verification with structured evidence is still underexplored. Recently,  introduced a new large-scale dataset, TabFact, for verifying statements based on structured evidence in tables. Figure 1 presents an example table and the corresponding entailed and refuted statements from TabFact. The task of fact verification based on structured evidence is challenging in two aspects. First, traditional language models trained on unstructured text are not directly applicable to learn representations for structured text. It is difficult for such language models to understand a sentence like "Round Pick Player College 1 1 Ralph Sampson Virginia ..." by directly concatenating the table cells in Figure 1. Second, detecting misinformation with structured evidence involves not only linguistic inference but also numerical reasoning such as addition, subtraction, sorting, and counting over records. For example, to verify the statement "Ralph Sampson was two picks ahead of Rodney Mccray", we need first to find in which order each was picked and subtract them.
Table representation learning is important for utilizing table data as evidence for fact verification. Most existing methods apply BERT (Devlin et al., 2018) model to learn table representations. Table-BERT  uses simple templates to transform tables into "somewhat natural" sentences, and fine-tunes BERT on pairs of statement sentence and corresponding table "sentence". However, this model adds many extra tokens to the tables, sometimes doubling the length of the original table token sequence. Zhong et al. (2020) propose to first derive logical forms from the table and the statement. A heterogeneous graph is then constructed to capture connections between table cells, functions and arguments and statement tokens. A graph-enhanced contextual representation is learned for each token by only paying attention to the neighbor nodes in the graph when applying BERT model. However, as we show in §4.2, BERT is less effective when it is pre-trained on unstructured data but applied to structured data such as tables.
Table-based fact verification also requires numerical reasoning over table records.  propose a Latent Program Algorithm (LPA), where the statements are parsed into potential programs and a weakly supervised discriminator is trained to assign confidence score to each program. The output from the latent programs are aggregated or ranked according to their confidence score as the final prediction. Zhong et al. (2020) propose to learn a representation for the program with a program-driven neural module network, where semantic compositionality is dynamically modeled along the program parsing structure in a bottomup style. The program representation and the token representation for the table, statement and linearized program are then combined to make the final prediction. However, these two models depends on weakly-supervised labels, which could be noisy, to derive potential programs. Recently, the same authors of Table-BERT released a new model, GNN-TabFact 1 , applying Numerical-aware Graph Neural Networks (NumGNN) on top of Table-BERT to learn to compare numerical cells in the same col-umn. Nonetheless, it requires constructing a graph neural network and conducting iterative message passing among the nodes in the graph.
We propose to adapt the Table Parsing (TAPAS)  model (Herzig et al., 2020), which has proven effective in question answering over tables, to model tables for fact verification. TAPAS concatenates all table cells without adding extra tokens and then extends BERT's architecture with additional embedding layers to capture the table structure and numerical comparison information for each token in the table. We replace its original two top classification layers for answer generation with a single classification layer on the [CLS] token to classify whether a given statement is entailed or refuted by the table. Our experimental results show that with proper pre-training, TAPAS outpeforms the state-of-theart GNN-TabFact model, increasing the accuracy of statement classification from 72.2% to 73.9% on the TabFact test set even without modeling numerical information. By further adding ranking information for numerical rows and pre-training on the question answering task, TAPAS achieves 76% accuracy, with 89% on the simple statements and 69.8% on the complex statements. We also perform further analysis to examine: (1) how the numerical comparison information improves TAPAS's performance when ranking information is needed; (2) how the complexity of the training set affects model performance on simple and complex statements; and (3) how well systems trained on TabFact generalize to other fact verification tasks. This paper's primary contributions are: (1) exploring the effect of table structure modeling on fact verification; (2) measuring the importance of language model pre-training on tabular data; and (3) analyzing the performance of fact-verification models on different numerical reasoning subtasks and errors.

Related Work
Fact Verification Thorne et al. (2018) introduce a new dataset for fact extraction and verification (FEVER) with claims generated by altering sentences extracted from Wikipedia. Nie et al. (2019) propose a neural semantic matching network based on a bidirectional LSTM to retrieve related evidence and predict whether the claim is entailed, neutral, or contradicted by the evidence. (Soleimani et al., 2019) propose to adopt a pretrained BERT model for evidence retrieval and fine tuning it for evidence-claim relation prediction. Jobanputra (2019) propose an unsupervised question-answering based approach for fact checking by generating questions for a claim first and predicting the masked span, which is compared to the ground truth answer for label classification. Zhong et al. (2020) construct two graphs for evidence and claim via semantic role labeling, and use graph-based reasoning for fact checking.
Structured Data Modeling Modeling structured data is essential for multiple tasks, e.g., table classification, table population, table retrieval, question answering , data-to-text generation, and table-based fact verification.
Ghasemi-Gol and Szekely (2018), Trabelsi et al. (2019) and Deng et al. (2019) propose to embed tabular data into a vector space, using table structure to classify tables into different categories, adding more rows or columns to tables and retrieving tables given query keywords. Nishida et al. (2017) employ RNNs to encode cell content and CNNs to encode table structure to better capture semantic features for table classification. Haug et al. (2018) propose to generate candidate logical forms from a question, convert logical forms to paraphrases, and rank them according to their similarity to the original question to generate the answer for the questions. Herzig et al. (2020) add additional embedding layers to a BERT model to capture table structure and numerical information, and add two classification layers to predict aggregation functions and corresponding table cells to generate an answer for a given question.
Much current table-to-text generation work focus on generating biographies from Wikipedia infoboxes (Lebret et al., 2016;Bao et al., 2018) or summarizing basketball games (Wiseman et al., 2017;Puduppully et al., 2019) according to the box-and line-score tables. Parikh et al. (2020) release a dataset for controlled table-to-text generation, where table cells are highlighted for the target sentences.  propose a new natural language generation task, where the model tasked with generating statements that entailed by a given table.
Chen et al. (2019) introduce a dataset for fact verification given tabular data as evidence and propose a BERT-based Table-BERT model and a latent program algorithm (LPA) model for this task. Zhong et al. (2020) propose to first drive the program from the table and statement, and then learn graph-enhanced contextual representations for both the table tokens and the program to classify the statements. After this paper was submitted, four related papers were published, which explore additional questions in table-based fact verification.  propose to utilize masking in the self attention layer to model table structure. Shi et al. (2020) and  explore how to effectively combine both linguistic information and symbolic information for table-based fact verification.  generate synthetic datasets to pre-train a TAPAS model to better understand tables for downstream tasks such as table-based fact verification and question answering.

Methods
We describe adapting the TAPAS model to fact verification over tables and then introduce different pre-trained models on which we fine-tune the TAPAS for the table verification task. Table-BERT model, the TAPAS model also flattens the input table into a sequence of tokens. It concatenates all the tokenized table cells in a row and then concatenates all the row sequences into a table sequence. In addition to the position and segment embeddings used in Transformer language models, four additional position embedding layers are introduced for encoding table structure and numerical comparison relations among cells in each column. For each token in the table, there is a column index and a row index indicating the position of the token in the table. For numerical and date columns (as determined by pandas), table cells are sorted to generate a rank index and a inverse rank index for each token according to their position in the sorted column. The sequence of tokens in the statement to be verified is concatenated with the corresponding table token sequence with the [SEP] token to indicate where the table sequence starts. For the statement sequence, all the row, column, rank and inverse rank indexes are set as 0. A special token [CLS] is added before this entire input sequence for classification purpose. Figure 2 shows how the table and statement from Figure 1 are indexed for each embedding layer. We remove the top two classification layers used in the original TAPAS model for the question answering task, and use the pooled output for the [CLS] token for statement classification. To explore the impact of different table encodings, we experiment with two variants of TAPAS: TAPAS-Row-Col, which only utilizes the column and row index embedding, and TAPAS-Row-Col-Rank, which leverages the additional ranking and inverse ranking information for numerical columns.

TAPAS for table-based fact verification Similar to the
Pre-training We fine-tune the adapted TAPAS model for the table verification task starting from three different pre-trained models. The first pretrained model is the widely applied BERT model (Devlin et al., 2018) trained on the BooksCorpus and Wikipedia text. Lists and tables are removed from the training text, so this model is merely trained on unstructured data. The second pretrained model is the TAPAS-Row-Col-Rank model pre-trained by Herzig et al. (2020) on a large number of Wikipedia text-table pairs with a masked language modeling task. This pre-trained model should be able to better understand the tables by capturing the structure information and numerical information. The third pre-trained model is the TAPAS-Row-Col-Rank model trained on the SQA (Iyyer et al., 2017) dataset with a question answering task. We only utilize their parameters for the bottom layers encoding the text and table sequence. Since this model is trained to generate answers by selecting table cells and predicting the aggregation function to be used on the cells, it should be able to capture more complicated numerical information.

Experiments
In this section, we first introduce the experimental settings in detail ( §4.1). Then we present the results of comparing two variants of TAPAS model and other state-of-the-art models on the table-based fact verification task in §4.2. More discussions about the models are provided in §4.3.

Experimental Setup
Dataset In the main experiment, we evaluate different models on TabFact, a large-scale dataset for table-based fact verification proposed by . TabFact is generated by asking annotators to write statements given a  Comparisons We compare TAPAS-Row-Col and TAPAS-Row-Col-Rank model with the following state-of-the-art models: a. Latent Program Algorithm (LPA). This model  first uses trigger words to prune pre-defined APIs including around 50 functions such as min, max. Then candidate latent programs are constructed by using breadth-first-search with memoization, and a discriminator is trained with weakly-supervised labels to assign confidence score to the programs. Their best performing model ranks all the latent programs by the discriminator confidence score and make the prediction by executing the top-rated program. b.   Evaluation Metric All models are evaluated for accuracy on classifying test statements as entailed or refuted by the corresponding evidence table.   Table 2 shows that, when TAPAS-Row-Col is fine-tuned from the original BERT model, shrinking the table only to the related columns significantly improves its accuracy from 60.5% to 68.3%. However, when we fine tune TAPAS-Row-Col model pre-trained on the Wikipedia tables, the difference is not that significant. It shows that filtering the table columns mainly helps when applying language models pretrained on unstructured data to structured data.

Main Results
What is the effect of pre-training? We finetuned TAPAS-Row-Col on two different types of pre-trained model: the original BERT model pretrained on unstructured text and the TAPAS model pre-trained on Wikipedia Tables together with sentences from the corresponding text paragraph by Herzig et al. (2020). Both of the pre-trained models are trained with the masked language modeling task. As we can see, pre-training on tables significantly improves performance of TAPAS-Row-Col from 68.3% to 73.9%. Fine-tuning TAPAS-Row-Col-Rank on the model pre-trained on the question answering task, where numerical reasoning skills are required to generate an answer for a given question, further improves accuracy from 74.5% to 76%. However, if we only fine-tune on the subset of table columns matching the statement, TAPAS-Row-Col-Rank does not benefit from the pre-trained question answering model. We conjecture that it is because the original model is pre-trained on full tables.

Does adding numerical comparisons help?
Both GNN-TabFact and TAPAS-Row-Col-Rank model numerical comparison relations among cells in the same column. GNN-TabFact learns a representation for each table cell by iteratively passing their numerical relations such as greater or less in a graph neural network. TAPAS-Row-Col-Rank model adds the rank and inverse rank index for cells in the same column as special position embedding for each token in the table cells. GNN-TabFact significantly outperforms Table-BERT by adding a numerically aware graph neural network on top of it. It also outperforms the LogicalFactChecker model, which first derives a program from the table and the statement and then learns the representation for the programs via a program-guided neural module network. Also, TAPAS-Row-Col-Rank model significantly outperforms TAPAS-Row-Col by introducing rank position embedding for each numerical column. We also find that TAPAS-Row-Col-Rank outperforms GNN-TabFact without complicated graph inference. It is worth noting that even TAPAS-Row-Col, which does not utilize any numerical information, outperforms GNN-TabFact, the previous state of the art on TabFact. This again demonstrates the advantage of directly encoding ta-  ble structure in TAPAS and the value of pre-training on structured data.

Further Experiments and Discussion
To analyze model performance, we perform further experiments on different training and test sets.
How does numerical information help? To examine how numerical comparison information helps to improve performance, we use heuristic rules to find the test statements involving comparing table records, summing table columns, and counting table records. We first find all words ending in -est in complex statements and remove those that do not belong to superlative words. Then we extract all test samples from the complex set containing these words or the word most as the superlative test set. We construct a comparative test set in a similar way by finding words ending in -er. Non-comparative words are removed and the words more and less are added to the set. We constrain the comparative samples to have the word than together with one of the comparative words. The sum test set is constructed by finding all the samples containing total or sum, the count set is constructed by finding all the samples containing all, every, none, only and each. The final superlative and comparative test sets contain 1,701 and 1,366 instances, respectively. The sum and counting sets contain 344 and 1,710, respectively.   was a Swedish field marshal. 2 As a sophomore, Peters averaged 16.8 points As a sophomore, Peters averaged 16.8 points and 6.7 rebounds per game. and 66.7 rebounds per game. 3 Ralph Herseth (1909 -1969) was the 21st governor Ralph Herseth (1909 -1969) was the 11st governor of South Dakota from January 6, 1959 to of South Dakota from January 6, 1959 to January 3, 1961.
January 3, 1961.  From ToTTo, using negative sampling, we derive a synthetic dataset for fact verification to compare with the TabFact data. We first removed all the samples in Totto data that also appears in TabFact data. Since the cells related to the summarizing sentence are highlighted in ToTTo, we extract all the sentences containing entities that exactly match the highlighted table cells and treat them as as facts. Then for each fact statement, we randomly choose an entity in it and replace it with a randomly chosen cell from the same column to generate a false statement. To ensure the false statement is different from the fact statement, we only choose cells whose values differ from the original. We end up with 75,292 training, 8,366 validation, and 5,242 test samples. We fine-tune the TAPAS-Row-Col-Rank model on this synthetic dataset. The results on this synthetic dataset and TabFact are listed in Table 5. Training on synthetic data achieves 76.4% accuracy on the simple TabFact statements, while training on TabFact achieves 79.2% accuracy on synthetic data. This confirms that the model trained on TabFact model is able to recognize cell copying errors. Both models are better at classifying positive statements than negative statements, which is more obvious for the model trained on TabFact.
We also generated another synthetic dataset with uniformly distributed character-editing errors on digits to see whether models trained on TabFact could capture this kind of error. We first extract all the test sentences containing numbers that exactly matches the highlighted table cells. Then we randomly choose to insert, delete or substitute one random digit in the first position (to make the task easier) of a randomly chosen matching number in the sentence (to ensure there is clue in the table). A total of 1,126 synthetic refuted statements are generated for testing. We check whether the TAPAS-Row-Col-Rank model trained on TabFact can recognize this type of error. Only 43.0% of these synthetic statements are classified as refuted by the model, which is much lower than the 71.6% accuracy on the refuted statements generated above by cell copying. It reveals that models trained on TabFact data are less sensitive to purely numerical errors. Table 6 shows examples of synthetic false statements the model trained on TabFact failed to recognize.

Conclusions
We adapted the Table Parsing (TAPAS) model, with  efficient encoding of table content, structure, and numerical comparison information for the task of table-based fact verification. We compared two variant models: TAPAS-Row-Col, which models table contents and structure, and TAPAS-Row-Col-Rank, which adds numerical comparison information. Experiments showed that both TAPAS-Row-Col and TAPAS-Row-Col-Rank outperform the state-of-the-art numerically-aware graph neural net-work model by pre-training on tabular data. We also examined how ranking information helps improve TAPAS's performance on superlative, comparative, sum, and count statements. Models trained on different datasets were compared to study how question complexity affects model performance. We also constructed two synthetic datasets to examine the generalization of these models. We find models trained on TabFact perform well on errors arising from simple cell replacement but not on digit-editing errors. In future, we aim to extend TAPAS to explicitly model more numerical operations for fact verification and to correct false statements automatically given tabular evidence.