Joint Verification and Reranking for Open Fact Checking Over Tables

Structured information is an important knowledge source for automatic verification of factual claims. Nevertheless, the majority of existing research into this task has focused on textual data, and the few recent inquiries into structured data have been for the closed-domain setting where appropriate evidence for each claim is assumed to have already been retrieved. In this paper, we investigate verification over structured data in the open-domain setting, introducing a joint reranking-and-verification model which fuses evidence documents in the verification component. Our open-domain model achieves performance comparable to the closed-domain state-of-the-art on the TabFact dataset, and demonstrates performance gains from the inclusion of multiple tables as well as a significant improvement over a heuristic retrieval baseline.


Introduction
Verifying whether a given fact coheres with a trusted body of knowledge is a fundamental problem in NLP, with important applications to automated fact checking (Vlachos and Riedel, 2014) and other tasks in computational journalism (Cohen et al., 2011;Flew et al., 2012). Despite extensive investigation of the problem under different conditions including entailment and natural language inference (Dagan et al., 2005;Bowman et al., 2015) as well as claim verification (Vlachos and Riedel, 2014;Alhindi et al., 2018;Thorne and Vlachos, 2018), relatively little attention has been devoted to the setting where the trusted body of evidence is structured in nature -that is, where it consists of tabular or graph-structured data.
Recently, two datasets were introduced for claim verification over tables Gupta et al., 2020). In both datasets, claims can be verified Figure 1: Example query to be evaluated against two retrieved tables. Named entities represent a strong baseline for retrieval, but ultimately a more complex model is required to distinguish highly similar tables.
given a single associated table. While highly useful for the development of models, this closed setting is not reflective of real-world fact checking tasks where it is usually not known which table to consult for evidence. Realistic systems must first retrieve evidence from a large data source. That is, realistic systems must operate in an open setting.
Here, we investigate fact verification over tables in the open setting. We take inspiration from similar work on unstructured data (Chen et al., 2017;Nie et al., 2019;Karpukhin et al., 2020;Lewis et al., 2020), proposing a two-step model which combines ad-hoc retrieval with a neural reader. Drawing on preliminary work in open question answering over tables (Sun et al., 2016), we perform retrieval based on simple heuristic modeling of individual table cells. We combine this retriever with a RoBERTa-based (Liu et al., 2019) joint reranking-and-verification model, performing fusion of evidence documents in the verification component. This corresponds to the approach suggested for question answering by e.g. Izacard and Grave (2020). We evaluate our models using the recently introduced TabFact dataset . While initially developed for the closed domain, the majority of claims are sufficiently contextindependent that they can be understood without knowing which table they were constructed with reference to. As such, the dataset is suitable for the open domain as well. Our models represent a first step into the open domain, achieving open-domain performance exceeding the previous closed-domain state of the art-outside of Eisenschlos et al. (2020), which includes pretraining on additional synthetic data. We demonstrate significant gains from including multiple tables, and these gains are increasing as more tables are used. We furthermore present results using a more realistic setting where tables are retrieved not just from the 16,573 TabFact tables, but from the full Wikipedia dump. Our contributions can be summarized as follows:

Open-Domain Table Fact Verification
Formally, the open table fact verification problem can be described as follows. Given a claim q and a collection of tables T , the task is to determine whether q is true or false. As such, we approach the task by modeling a binary verdict variable v as p(v|q, T ). This is in contrast to the closed setting, where a single table t q ∈ T is given, and the task is to model p(v|q, t q ). Since there are large available datasets for the closed setting Gupta et al., 2020), it is reasonable to expect to exploit t q during training; however, at test time, this information may not be available. We follow a two-step methodology that is often adopted in open-domain setting for unstructed data (Chen et al., 2017;Nie et al., 2019;Karpukhin et al., 2020;Lewis et al., 2020) to our setting. Namely, given a claim query q, we retrieve a set of evidence tables D q ⊂ T (Section 3), and subsequently model p(v|q, D q ) in place of p(v|q, T ) (Section 4).

Entity-based Retrieval
We first design a strategy for retrieving an appropriate subset of evidence tables for a given query. For question answering over tables, Sun et al. (2016) demonstrated strong performance on retrieving relevant tables using entity linking information, following the intuition that many table cells contain entities. We take inspiration from these results. In their setting, claim entities are linked to Freebase entities, and string matching on the alias list is used to map entities to cells. To avoid reliance on a knowledge graph, we instead use only the textual string from the claim to represent entities, and perform approximate matching through dot products of bi-and tri-gram TF-IDF vectors. We pre-compute bi-and trigram TF-IDF vectors z(c 1 t ), ..., z(c m t ) for every table t ∈ T with cells c 1 t , ..., c m t . Then, we identify the named entities e 1 q , ..., e n q within the query q. For our experiments, we use the named entity spans for TabFact, provided by  as part of their LPA-model. 1 We compute bi-and trigram TF-IDF vectors z(e 1 q ), ..., z(e n q ) for the surface forms of those entities. To retrieve D q given q, we then score every t ∈ T . Since we are approximating entity linking between claim entities and cells, we let the score between an entity and a table be the best match between that entity and any cell in the table. That is: In other words, we compute for every entity the best match in the table, and score the table as the sum over the best matches. To construct the set of evidence tables D q , we then retrieve the top-k highest scoring tables. Our choice to use bi-and tri-gram TF-IDF as the retrieval strategy was determined empirically -see Section 5.1 and Table 1 for experimental comparisons.

Neural Verification
To model p(v|q, D q ), we employ a RoBERTabased (Liu et al., 2019) late fusion strategy (see Figure 2 for a diagram of our model). Given a query q with a ranked list of k retrieved tables D q = (d 1 q , ..., d k q ), we begin by linearising each table. Our linearisation scheme follows . We first perform sub-table selection by excluding columns not linked to entities in the query. Here, we reuse the entity linking obtained during the retrieval step (see Section 3), and retain only the three columns in which cells received the highest retrieval scores. We linearise each row separately, encoding entries and table headers. Suppose r is a row with cell entries c 1 , c 2 , ..., c m in a table, where the corresponding column headers are h 1 , h 2 , ..., h m . Row number r is mapped to "row r is : We construct a final linearisation L q,t for each query-table pair q, t by prepending the query to the filtered table linearisation. We then encode each L q,t with RoBERTa, and obtain a contextualised embedding f (d k q ) ∈ R n for every table by using the final-layer embedding of the CLS-token. We construct the sequence of embeddings f (d 1 q ), ...f (d k q ) for all k tables.
When the model attempts to judge whether to rely on a given table for verification, other highlyscored tables represent useful contextual information (e.g., in the example in Figure 1, newspapers belonging to the same owner may be likely to also share political leanings). Nevertheless, each table embedding f (d k q ) is functionally independent from the embeddings of the other tables. As such, contextual clues from other tables cannot be taken into account. To remedy this, we introduce a crossattention layer between all tables corresponding to the same query. We collect the embeddings f (d k q ) of each table into a tensor F (D q ). We then apply a single multi-head self-attention transformation as defined by Vaswani et al. (2017) to this tensor, and concatenate the result. That is, we compute an attention score for head h from table i to table j with query q as: where σ is the softmax function, and W Q and W K represent linear transformations to queries and keys, respectively. We then compute an attention vector for that head as: and finally construct contextualized table representations through concatenation as: We subsequently use F * (D q ), i.e. the tensor containing f * (d 1 q ), ..., f * (d k q ), for downstream predictions. We note that our approach can be viewed as an extension of the Table-BERT algorithm introduced by  to the multi-table setting, using an attention function to fuse together the information from different tables.

Training & Testing
Relying on a closed-domain dataset provides a table with appropriate information for answering each query; namely, the table against which the claim is to be checked in the closed setting. Although this information is not available at test time, we can construct a training regime that allows us to exploit it to improve model performance. We experiment with two different strategies: jointly modeling reranking of tables along with verification of the claim, and modeling for each table a ternary choice between indicating truth, falsehood, or giving no relevant information. Later, we demonstrate how the former leads to increased performance on verification, while the latter gives access to a strong predictor for cases where no appropriate table has been retrieved.
Joint reranking and verification For the joint reranking and verification approach, we assume that a best table for answering each query is given and can be used to learn a ranking function. We model this as selecting the right table from D q , e.g., through a categorical variable s that indicates which table should be selected. We then learn a joint probability of s and the truth value of the claim v over the tables for a given query. Assuming that s and v are independent, p(s, v|q, D q ) is also a categorical distribution with one correct outcome that can be optimized for (that is, one correct pair of table and truth value). As such, we let: Where W : R 2n → R 2 is an MLP and σ is the softmax function. At train time, we obtain one cross-entropy term corresponding to p(s, v|q, D q ) per query. At test time, we marginalize over s to obtain a final truth value: This formulation has the additional benefit of also allowing us to make a prediction on which table matches the query. We can do so by marginalizing over v: (7) With this loss, we train the model by substituting for D q a set D * q containing wherein the gold table is guaranteed to appear. We ensure this by replacing the lowest-scored retrieved table in D q with the gold table whenever it has not been retrieved.
Ternary verification At test time, there may be cases where a table refuting or verifying the fact is not contained in D q . For some applications, it could be useful to identify these cases. We therefore design an alternative variant of our system better suited for this scenario. Intuitively, each table can represent three outcomes -the query is true, the query is false, or the table is irrelevant. We can model this through a ternary variable i such that for table t: Where W : R 2n → R 3 is an MLP and σ is the softmax function. During training, we assign true or false to the gold table depending on the truth of the query, and irrelevant to every other table.
We then use the mean cross-entropy over the tables associated with each query as the loss for each example. At test time, we compute the truth value v of each query as:

Experiments
We apply our model to the TabFact dataset , which consists of 92,283 training, 12,792 validation and 12,792 test queries over 16,573 tables. The task is binary classification of claims as true or false, with an even proportion of the two classes in each split.
To benchmark our open-domain models and construct performance bounds, we begin by evaluating in the closed domain. As an upper bound, we can then compare against the performance of the closed-domain system scored using a single table retrieved through an oracle. As a lower bound, we can again use the closed-domain system, but using the highestranked table according to our TF-IDF retriever. The evaluation metric is simply prediction accuracy.

Retrieval
We choose bi-and tri-gram TF-IDF as the retrieval strategy empirically. To address the comparative performance of this choice, we compute and rank in Table 1 the retrieval scores obtained through our strategy on the TabFact test set. We compare against several alternative strategies: bi-and trigram TF-IDF vectors for all words in the query (rather than just the entities), word-level TF-IDF vectors for entities, and entity-level exact matching. Our bi-and tri-gram TF-IDF strategy yields by far the strongest performance. We furthermore demonstrate how the exclusion of unigrams from the TF-IDF vectors slightly increases performance.

Verification
In Table 2, we compare our best-performing models to the closed-setting system from , as well as to several recent models from the literature Eisenschlos et al., 2020). We include results with both losses as discussed in Section 4, using varying numbers of tables.
With an accuracy of 75.1%, we obtain the best open-domain results with our model using the joint reranking-and-verification loss and five tables. We see performance improvements when increasing the number of tables, both from 1 to 3 and from 3 to 5. In the closed domain, the 77.6% accuracy our model achieves is a significant improvement over the 74.4% the strongest comparable baseline reached. This may be due to our use of RoBERTa, which has previously been found to perform well for linearised tables (Gupta et al., 2020).
Relying purely on TF-IDF for retrieval -that is, using our system with only one retrieved  -yields a performance of 73.2%. This is a surprisingly small decrease compared to the closed domain, given that an incorrect table is provided in approximately a third of all cases (see Table 1). We suspect that many cases for which the retriever fails are also cases for which the closed-domain model fails. To make sure we are not seeing the effect of false negatives (e.g., tables which are not the gold table, but which nevertheless have the information to verify the claim), we run the model in a setting where one retrieved table is used, but the gold table is removed from the retrieval results; here, the model achieves an accuracy of only 56.2%. We furthermore test a system relying on a random table rather than a retrieved table; with a performance drop to 53.1, we find that the information in the retrieved table is indeed crucial to obtain high performance (rather than the performance being purely a consequence of, say, RoBERTa weights).
To understand how our model derives improvement from the addition of more tables, we compute in Table 3 the performance of our reranking-andverification model when TF-IDF returns the correct table at rank 1, rank 2-3, or rank 4-5. Immediately, we notice a much stronger improvement from using multiple tables when TF-IDF fails to correctly identify the gold table. This is natural, as those are exactly the cases where our model (as opposed to the baseline) has access to the appropriate information to verify or refute the claim.
Interestingly, using three tables improves on using one table even when the gold table is not included among the top three (from 53.9% to 58.2%), and using five tables improves on using three tables also when the gold table is included among the top three (from 66.7% to 73.1%). Manual inspection reveals that our model in some cases relies on correlations between tables -if a sports team loses games in three tables, then that may give a  higher probability of that team also losing in an unretrieved, hypothetical fourth table. To test this, we apply the model in a setting where we retrieve the top five tables excluding the gold table, and a setting where we use five random tables. Using highly scored (but wrong) tables, we achieve a performance of 59.4%, a significant improvement on the 53.1% we achieve using random tables. This supports our hypothesis that other good tables can provide useful background context for verification. It should be noted that such inferences, while increasing model performance, may also increase the degree to which the model exhibits biases. Depending on the application, this may as such not be a desirable basis for verification. Returning to the example in Figure 1, inferring ownership on the basis of political affiliation when no other information is available may increase accuracy on average, but it can also lead to erroneous or biased decisions (indeed, for the claim in the example, the prediction would be wrong).

Ablation Tests
Our best-performing model from Table 2 relies on two innovations: The cross-attention function which contextualizes retrieved tables in relation to each other, and the joint reranking-and-verification loss. In Table 4, we evaluate the model without either of these. Leaving the attention function out is simple -we use f (d k q ) for each table directly for predictions. We model performance without the reranking component of our loss function by assuming a uniform distribution over the tables.
As can be seen, the combination of both is strictly necessary to obtain strong performance -indeed, without our joint objective, the model performs worse than simply applying the baseline model to the top table returned by TF-IDF as in Table 2. The ability for the model to express the relative relatedness of tables to the query is crucial.  table being the  gold table with our ternary loss, ( ). We also include a most frequent class baseline ( ).
We include further investigation of the role our cross-attention mechanism plays in Appendix E.

Predicting Insufficient Information
In realistic settings, some claims will not be directly answerable from any retrieved table. In such cases, it can be valuable to explicitly inform the usergiving false verifications or refutations when sufficient information is not available is misleading, and can decrease user trust. To model a scenario where the lack of relevant information must be detected, we create a classification task wherein the model must predict for all examples, whether the gold table is among the k documents in D q . Using the ternary loss, our model directly gives the probability of each table containing appropriate information as (1 − p(I t = irrelevant|q, t)). We can estimate the suitability of the best retrieved table for verifying the claim as max t (1 − p(I t = irrelevant|q, t)), and apply a threshold τ 1 to classify D q as suitable or unsuitable. For the joint loss, a more indirect approach is necessary. Intuitively, if our model is too uncertain about which table answers the query, there is a high likelihood that no suitable table has been retrieved. This corresponds to the entropy of the reranking component H s (s|q, D q ) after marginalizing over the truth value of the claim exceeding some threshold τ 2 .
We compare these strategies in Figure 3, obtaining Precision-Recall curves by measuring at varying τ 1 and τ 2 . We find that while both approaches outperform a most frequent class baseline by a significant margin, the ternary loss performs better than the joint loss. As such, the choice between the two losses represents a tradeoff between raw performance (see Tables 2 and 5) and the ability to identify missing or incomplete information.

Wikipedia-scale Table Verification
In our experiments so far, we have relied on the 16,573 TabFact tables as the knowledge source. The tables selected for TabFact were taken from WikiTables (Bhagavatula et al., 2013), and filtered so as to exclude "overly complicated and huge tables" . Moving beyond the scope of that dataset, a fully open fact verification system should be able to verify claims over even larger collections of tables -for example, the full set of tables available on Wikipedia. To make a preliminary exploration of that larger-scale setting, we include in Table 5 the performance of our approach evaluated using roughly 3 million tables automatically extracted from Wikipedia.

Model Accuracy
RoBERTa only 52.1 Ours (1 table) 53.6 Ternary loss, 3 tables 55.8 Ternary loss, 5 tables 57.5 Joint loss, 3 tables 56.1 Joint loss, 5 tables 58.1 As can be seen, our approach improves on the naive strategy of using a single table and a closeddomain verification component also in this more complex setting. To verify that the inference happens on the basis of the retrieved tables and not simply the RoBERTa-weights, we include also the performance of a model which simply uses classification on top of a RoBERTa-encoding of the claim. Similar to our previous experiments, the joint-loss model with five retrieved tables performs the strongest. We note that it is unclear whether the performance we observe here originates from correlations obtained through background information (as we see in Section 5.2 when the retriever fails to find the appropriate table), or due to verification against a single entirely appropriate table happening at a lower rate than when using TabFact.

Related Work
Semantic querying against large collections of tables has previously been studied for question answering. Sun et al. (2016) used string matching between aliases of linked entities to search millions of tables crawled from the Web, with retrieved table cells providing evidence for a question answering task. Jauhar et al. (2016)  Neural modeling of tables has been the subject of several recent papers. Aside from the original BERT-based model in , the closest to our work is . In these paper, a pretrained BERT-based encoder for tables is introduced and demonstrated to yield strong improvements on several semantic parsing tasks. Chen et al. (2019) introduced a model to automatically predict and compare column headers for tables in order to find semantically synonymous schema attributes. Similarly, Zhang and Balog (2019) introduced an autoencoder for predicting table relatedness.
Closed-domain semantic parsing over tables has been studied extensively in the context of question answering (e.g., Pasupat and Liang (2015); Khashabi et al. (2016); Yu et al. (2018)). In , a logic-based fact verification system was introduced to improve on the model presented in the initial TabFact paper .    -based models for various table semantic tasks, extending BERT with additional position embeddings denoting columns and rows.
Open-domain fact verification and question answering over unstructured, textual data has been studied in a series of recent papers. Early work resulted in several highly sophisticated full pipeline systems (Brill et al., 2002;Ferrucci et al., 2010;Sun et al., 2015). These provided inspiration for the influential DrQA model (Chen et al., 2017), which like ours relies on a TF-IDF-based heuristic retrieval model, and a complex reading model. Recent work (Karpukhin et al., 2020;Lewis et al., 2020) has built on this approach, developing learned dense retrieval models with dot-product indexing (Johnson et al., 2017), and increasingly advanced pretrained transformer-models for reading. The development of similarly fast, reliable and learnable indexing techniques for tables as well as text is an important direction for future work.
Concurrently with our work, Chen et al. (2020a) have introduced a BERT-based model to perform question answering over open collections of data including tables. Like ours, their model consists of separate retriever-and reader-steps. Their bestperforming reader employs a long-range sparse attention transformer (Ainslie et al., 2020) to jointly summarize all retrieved data. As in our case, their model demonstrates significant improvements from using multiple retrieved tables.

Conclusion
We have introduced a novel model for fact verification over large collections of tables, along with two strategies for exploiting closed-domain datasets to increase performance. Our approach performs on par with the current closed-domain state of the art, with larger gains the more tables we include. When using an oracle to retrieve a reference table, our approach also represents a new closed-domain state of the art. Finally, we have made an initial foray into Wikipedia-scale open-domain table fact verification, demonstrating improvements from multiple tables also when using a full set of Wikipedia tables as the knowledge source. Our results indicate that the use of multiple tables can provide contex-tual clues to the model even when those tables do not explicitly verify or refute the claim, because they can provide evidence for the probability of the claim. This is a double-edged sword, as reliance on such clues can increase performance while also inducing biased claims of truthfulness. Care will be needed in future work to disentangle the positive and negative aspects of this phenomenon.  We train the model using Adam (Kingma and Ba, 2015) with a learning rate of 5e − 6. We use a linear learning rate schedule, warming up over the first 30000 batches. We use a batch size of 32. Training was done on 8 NVIDIA Tesla V100 Volta GPUs (with 32GB of memory), and completed in approximately 36 hours.

C Retrieval accuracy for TabFact splits
The TabFact dataset comes with several different data splits. We include here the performance of our retrieval component for each split:

D Reranking Performance
In Section 4, we introduced our model as a joint system for fact verification and evidence reranking. A benefit of our formulation is the ability to reason about the ability of our model to rerank by marginalizing over the truth value of the claim, following Equation 7. In Table 8 Table 8: Ranking performance on the TabFact validation set, using either our TF-IDF retriever alone or reranking with our model. We test a version of our model using only a reranking loss, as well as joint-loss model with and without attention.
As can be seen, our joint loss provides a slight performance improvement when the attention component is included. Interestingly, the joint-loss model performs better than a system trained purely for reranking -this highlights the complementary nature of the reranking and verification tasks.

E The Role of Attention
An interesting question is the role attention plays in our model. As can be seen from Tables 2 and 8, our cross-attention module is necessary to achieve high performance -without it, the model struggles to identify which table should be used for verification. To investigate the function of attention, we plot in Figure 4 the strength of the cross-attention between each table for our five-table model. We produce separate plot for the two attention heads, as well as for each of the splits used in Table 3 representing the parts of the dataset where our TF-IDF retriever assigns the gold table rank respectively 1, 2-3, or 4-5.
For both attention heads, the attention function has clearly distinct behaviour when the gold table is retrieved as top 1; the degree to which that table attends to itself is much greater. We suspect that this is because of "easy" cases, where the attention function is used to separate a clearly identifiable "appropriate" table from the other tables. In harder cases, the model uses the attention focus to compare information across tables. To test this, we run the model in a setting where four random tables are used along with the gold table. In that setting, the division is even clearer. For the gold table, respectively 86 and 82 percent of the attention for the two heads is on average focused on itself; for the four random tables, the attention is evenly distributed over all tables except the gold table.
To distinguish the two heads, we in general see the first head exhibit a pattern of behaviour where each table assigns the majority of attention to itself -especially when that table is the gold table. The second head seemingly encodes a more even spread over the retrieved tables, perhaps representing general context more than an attempt to identify the gold table.