Exploring Decomposition for Table-based Fact Verification

Fact verification based on structured data is challenging as it requires models to understand both natural language and symbolic operations performed over tables. Although pre-trained language models have demonstrated a strong capability in verifying simple statements, they struggle with complex statements that involve multiple operations. In this paper, we improve fact verification by decomposing complex statements into simpler subproblems. Leveraging the programs synthesized by a weakly supervised semantic parser, we propose a program-guided approach to constructing a pseudo dataset for decomposition model training. The subproblems, together with their predicted answers, serve as the intermediate evidence to enhance our fact verification model. Experiments show that our proposed approach achieves the new state-of-the-art performance, an 82.7\% accuracy, on the TabFact benchmark.


Introduction
Fact verification aims to validate if a statement is entailed or refuted by given evidence. It has become crucial to many applications such as detecting fake news and rumor (Rashkin et al., 2017;Thorne et al., 2018;Goodrich et al., 2019;Vaibhav et al., 2019;Kryscinski et al., 2020). While existing research mainly focuses on verification based on unstructured text (Hanselowski et al., 2018;Yoneda et al., 2018;Nie et al., 2019), a recent trend is to explore structured data as evidence, which is ubiquitous in our daily life.
Verification performed with structured data presents research challenges of fundamental interests, as it involves both informal inference based on language understanding and symbolic operations such as mathematical operations (e.g., count and max). While all statements share the same set of operations, complex statements, which involve multiple operations, are more challenging than  simple statements. Pre-trained models such as BERT (Devlin et al., 2019) have presented superior performances on verifying simple statements while still struggling with complex ones: a performance gap exists between the simple and complex tracks . In this paper, we propose to decompose complex statements into simpler subproblems to improve table-based fact verification, as shown in a simplified example in Figure 1. To avoid manually annotating gold decompositions, we design a program-guided pipeline to collect pseudo decompositions for training generation models by distinguishing four major decomposition types and designing templates accordingly. The programs we used are parsed from statements with a weakly supervised parser with the training signals from final verification labels. Figure 1 shows a statementprogram example. We adapt table-based natural language understanding systems to solve the decomposed subproblems. After obtaining the answers to subproblems, we combine them in a pairwise manner as intermediate evidence to support the final prediction.
We perform experiments on the recently proposed benchmark TABFACT  and achieve a new state-of-the-art performance, an 82.7% accuracy. Further studies have been conducted to provide details on how the proposed models work.

Task Formulation and Notations
Given an evidence table T and a statement S, we aim to predict whether T entails or refutes S, denoted by y ∈ {1, 0}. For each statement S, the executable program derived from a semantic parser is denoted as z. An example of program is given in Figure 1. Each program z = {op i } M i=1 consists of multiple symbolic operations op i , and each operation contains an operator (e.g., max) and arguments (e.g., all_rows and attendance). A complex statement S can be decomposed into subproblems , our model maximizes the objective log p θ (y|T, S, E).

Statement Decomposition
Constructing a high-quality dataset is key to the decomposition model training. Since semantic parsers can map statements into executable programs that not only capture the semantics but also reveal the compositional structures of the statements, we propose a program-guided pipeline to construct a pseudo decomposition dataset.

Constructing Pseudo Decompositions
Program Acquisition. Following , we use latent program algorithm (LPA) to parse each statement S into a set of candidate To select the most semantically consistent program z * among all candidates and mitigate the impact of spurious programs, we follow  to optimize the program selection model with a margin loss, which is detailed in Appendix A.1.
By further removing programs that are labelinconsistent or cannot be split into two isolated sub-programs from the root operator, we obtain the remaining (T, S, z) triples as the source of data construction 1 .
Decomposition Templates. Programs are formal, unambiguous meaning representations for the corresponding statements. Designed to support 1 These triples do not involve any tables or statements in the dev/test set of the dataset used in this paper.  automated inference, the program z encodes the central feature of the statement S and reveals its compositional structures. Our statement decomposition is based on the structure of the program. Specifically, we first extract program skeleton z s by omitting arguments in the selected program z, then we group the (T, S, z) triples by z s to identify four major decomposition types: conjunction 2 , comparative, superlative, and uniqueness. Some simple templates associated with each decomposition type are designed, which contain instructions on how to decompose the statement, and this manual process only takes a few hours. In this way, we can construct pseudo decompositions, including sub-statements and sub-questions, by filling the slots in templates according to the original statements or program arguments. Templates and decomposition examples can be found in Figure 2. Each sample in our constructed pseudo dataset is denoted as a (S, c, D ) triple, where c indicates one of the four types and D is a sequence of pseudo decompositions.
Data Augmentation. With the (T, S, z) triples, we perform data augmentation. Since some entity mentions in S and z can be linked to cells in T , we can randomly replace the linked entities in S and z with different values in the same column of T . For example, in Figure 1, we can replace the linked entity "firhill" with another randomly selected entity "cappielow". Another augmentation strategy is inverting superlative and comparative. For the examples belong to superlative and comparative, we replace the original superlative or comparative in statements with its antonym, such as higher → lower and longest → shortest. In this way, we generate another 3k pseudo statement-decomposition pairs. In total, the final decomposition dataset used for generation model training includes 9,696 samples. More statistics are available in Appendix A.2.

Learning to Decompose
Decomposition Type Detection. Given a statement S, we train a five-way classifier based on BERT to identify whether the statement is decomposable and if yes, which decomposition type it belongs to. In addition to the four types mentioned in the previous section, we add an atomic category by involving additional non-decomposable samples. Only the statements not assigned with atomic labels can be used for decomposition.
Decomposition Model. We finetune the GPT-2 (Radford et al., 2019) on the pseudo dataset for decomposition generation. Specifically, given the (S, c, D ) triple, we train the model by maximizing the likelihood J = log p θ (D |S, c). We provide the model with gold decomposition type c during training and the predicted typeĉ during testing. Only informative and well-formed decompositions are involved in the subsequent process to enhance the downstream verification. In case some substatements need further decomposition, it can be implemented by resending them to our pipeline 3 .

Solving Subproblems
We adapt TAPAS (Eisenschlos et al., 2020), a SOTA model on table-based fact verification and QA task, to solve the decomposed subproblems. Verifying sub-statements is formulated as a binary classification with the TAPAS model fine-tuned on the TABFACT  dataset. To answer each sub-question, we use the TAPAS finetuned on WikiTableQuestions (Pasupat and Liang, 2015) dataset. We combine the subproblems and their answers in a pairwise manner to obtain the in- , an example evidence is shown in Figure 1.

Recombining Intermediate Evidence
Downstream tasks can utilize the intermediate evidence in various ways. In this paper, we train a model to fuse the evidence E together with the statement S and table T for table-based fact verification 4 . Specifically, we jointly encode S and T with TAPAS to obtain the concentrated representation h ST . We encode multiple evidence sentences with another TAPAS following the document-level encoder proposed in Liu and Lapata (2019) by inserting [CLS] token at the beginning of every single sentence e i and taking the corresponding [CLS] embedding h e i in the final layer to represent e i .
We employ a gated attention model to obtain aggregated evidence representation h evd and predict the final label as follows: where W are trainable parameters, σ is the sigmoid function, and ⊕ indicates concatenation.

Experiments
Setup. We conduct our experiments on a largescale     Evaluation of Decompositions. We use both an automated metric and human validation to evaluate the decomposition quality. For the automated metric, we randomly sample 1,000 training cases from the pseudo decomposition dataset as the hold-out validation set, based on which we use BLEU-4 (Papineni et al., 2002) to measure the generation quality. We also sample 100 decomposable cases from the TABFACT test set and ask three crowd workers to judge whether the model produces plausible decompositions. The ablation results in Table 3 indicate that data augmentation and the use of type in- 5 We also conduct significance tests over both the base and large models (the proposed model vs. TAPAS), with the onetail t-test. For the base model, the p-value is 4.7e-6 and for the large model, 3.2e-7. formation improve the decomposition quality, and the BLEU-4 score on the pseudo decomposition dataset well reflects the human judgements.
Since we remove the defective decompositions to reduce noise in the verification task, the number of decomposed cases involved by our final verification model varies according to the decomposition quality. We provide the percentages of valid decompositions on all data splits of TABFACT in Table 4. The results show that our decompositions do not completely align with the simple/complex split provided in TABFACT, and data augmentation can improve the number of valid decomposition by around 7%. On the downstream verification task, a lower-quality decomposition (39.4%) yields a 0.4% performance drop compared to our proposed decomposition model (46.7%).

Related Work
Existing work on fact verification is mainly based on evidences from unstructured text (Thorne et al., 2018;Hanselowski et al., 2018;Yoneda et al., 2018;Thorne et al., 2019;Nie et al., 2019;. Our work focuses on fact verification based on structured tables . Unlike the previous work Zhong et al., 2020;Shi et al., 2020;Eisenschlos et al., 2020), we propose a framework to verify statements via decomposition.
Sentence decomposition takes the form of Splitand-Rephrase proposed by Narayan et al. (2017) to split a complex sentence into a sequence of shorter sentences while preserving original meanings (Aharoni and Goldberg, 2018;Botha et al., 2018;Guo et al., 2020). In QA task, question decomposition has been applied to help answer multi-hop ques-tions (Iyyer et al., 2016;Talmor and Berant, 2018;Min et al., 2019;Wolfson et al., 2020;Perez et al., 2020). Our work mainly focuses on decomposing statements for table-based fact verification with pseudo supervision from programs.

Conclusion
In this paper, we propose a framework to better verify the complex statements via decomposition. Without annotating gold decompositions, we propose a program-guided approach to creating pseudo decompositions on which we finetune the GPT-2 for decomposition generation. By solving the decomposed subproblems, we can integrate useful intermediate evidence for final verification and improve the state-of-the-art performance to an 82.7% accuracy on TABFACT.

A.1 Program Selection
We fine-tune the BERT (Devlin et al., 2019) to model p θ (z|S), the probability of program z being semantically consistent with S. Since the gold programs are not available, we use the final verification labels as weak supervision. To mitigate the impact of spurious programs, i.e., programs execute to correct answers with incorrect operation combinations, we follow  to optimize the model with a margin loss: where z − and z + denote the label-inconsistent and label-consistent programs with the highest probability, respectively. γ is the parameter to control the margin. The margin loss can encourage selecting one program that is most semantically relevant to the statement while maintaining a margin between the positive (label-consistent) and the negative (label-inconsistent) programs.

A.2 Statistics of Pesudo Dataset
We have 9,696 pseudo statement-decomposition pairs in total, and the number of samples belong to four decomposition types is given in Table 5.
To train the decomposition type detection model, we add an additional atomic category with 1,739 statements.
Decomp. Type # of samples

A.3 Statistics of TABFACT Dataset
The statistics of TABFACT  can be found in Table 6, a large-scale table-based fact verification benchmark dataset on which we evaluate our method. The test set is further split into a simple set and a complex set, which include 4,171 and 8,608 sentences, respectively. A small test set with 1,998 samples are provided for human performance evaluation.

A.4 Compared Systems
• LPA (Chen et al., 2020) derives a program for each statement by ranking the synthesized program candidates and takes the program execution results as predictions.
• Table-BERT  takes a linearized table and a statement as the input of BERT for fact verification.
• LogicalFactChecker (Zhong et al., 2020) utilizes the structures of programs to prune irrelevant information in tables and modularize symbolic operations with module networks.
• HeterTFV (Shi et al., 2020) is a graph-based reasoning approach to combining linguistic information and symbolic information.
• SAT ) is a structure-aware Transformer that encodes structured tables by injecting the structural information into the mask of the self-attention layer.
• ProgVGAT  leverages the symbolic operation information to enhance verification with a verbalization technique and a graph-based network.
• TAPAS (Herzig et al., 2020;Eisenschlos et al., 2020) is the previous SOTA model on TAB-FACT which extends BERT's architecture to encode tables and is jointly pre-trained with text and tables.