Probabilistic Graph Reasoning for Natural Proof Generation

In this paper, we investigate the problem of reasoning over natural language statements. Prior neural based approaches do not explicitly consider the inter-dependency among answers and their proofs. In this paper, we propose PRobr, a novel approach for joint answer prediction and proof generation. PRobr defines a joint probabilistic distribution over all possible proof graphs and answers via an induced graphical model. We then optimize the model using variational approximation on top of neural textual representation. Experiments on multiple datasets under diverse settings (fully supervised, few-shot and zero-shot evaluation) verify the effectiveness of PRobr, e.g., achieving 10%-30% improvement on QA accuracy in few/zero-shot evaluation. Our codes and models can be found at https://github.com/changzhisun/PRobr/.


Introduction
Automatic reasoning over explicitly provided knowledge has been a persistent goal of AI (Newell and Simon, 1956;McCarthy et al., 1960). Early approaches focus on reasoning over formal (logical or probabilistic) representations. However, automatically constructing and reasoning over formal representations remain challenging. To bypass these challenges, in this work, we investigate reasoning over natural language statements instead of formal representations.
Given a set of facts and rules and a query (expressed in natural language), we aim to predict the answer and provide proof to prove or disprove the query. For example, in Figure 1, there are two facts, six rules and two queries, each of which is expressed by natural language. To predict the true/false of each query, starting from the facts, we need to reason deductively by applying given rules * Equal contribution.
Facts : F 1 : The circuit includes the battery. F 2 : The wire is metal.
Rules : R 1 : If the circuit includes the battery and the battery is not flat then the circuit is powered. R 2 : If the circuit includes the switch and the switch is on then the circuit is complete. R 3 : If the circuit does not have the switch then the circuit is complete. R 4 : If the wire is metal then the wire is conducting. R 5 : If the wire is plastic then the wire is not conducting. R 6 : If the circuit is powered and the circuit is complete and the wire is conducting then the current runs through the circuit. until we can derive the truth value of the query. The process of deduction can be represented as a graph, whose node is either a fact, rule or special NAF node (explained in the Section 2.1). Generating answer and proof together makes a system easier to interpret and diagnose. Recent work by PROVER (Saha et al., 2020) first explored this problem through two modules: question answering and proof generation. It trains these two modules through implicit parameter sharing, and then uses integer linear programming (ILP) to enforce consistency constraints (only test time). It is difficult to ensure that the proof generation module contributes to the question answering module, because the proof is not explicitly involved in the answer prediction. Parameter sharing becomes more limited under few/zero-shot settings, as demonstrated in our experiments. We expect the proof to enhance the capability of question answering, especially under few/zero-shot settings. One promising solution is to explicitly exploit more interaction between question answering and proof generation.
In this paper, we propose PROBR, a novel probabilistic graph reasoning framework for joint question answering and proof generation. PROBR defines a joint distribution over all possible proof graphs and answers with an undirected probabilistic graphical model (PGM). It directly characterizes the interaction between proofs and answers. PGMs generally incur intractable learning and inference for the complex graph (Koller and Friedman, 2009). For example, computing normalization constant in PGMs using traditional probabilistic propagation algorithm (e.g.sum-product algorithm (Kschischang et al., 2001)) requires large time complexity. Therefore, we propose a variational approach to maximize the pseudolikelihood of joint distribution to optimize the model more efficiently. First, a variational distribution was introduced based on mean-field assumption. Then we maximize the pseudolikelihood of joint distribution given the output of variational distribution. At the same time, we align these two distributions using the training data. PROBR can be efficiently trained by stochastic gradient descent. Our contributions are summarized as follows 1 : • We propose PROBR for joint question answering and proof generation, which defines a joint distribution over all possible proofs and answers with an undirected PGM to capture more dependencies.
• We present an efficient variational approximation method to learn PROBR.
• Experiments on several datasets verify the effectiveness of PROBR under multiple settings (supervised, few-shot, and zero-shot evaluation).

Task Definition
To reason over natural language statements, we design to answer the query and generate corresponding proof generation jointly. Figure 1 shows an example. Given a declarative query Q, and given relevant facts and rules (expressed in natural language), the task aims to predict the answer A (true/false) to the query Q based on the closed-world assumption (described in 2.1). Meanwhile, it generates a proof P (described in 2.2) to prove or disprove Q. 1 Our codes and models can be found at https://github.com/ changzhisun/PRobr/.

Semantics
We adopt the semantics of Datalog (Ceri et al., 1989) in this work. Following prior work (Clark et al., 2020), we make a closed-world assumption (CWA), which means a fact is true if it can be deduced based on a given context, and any fact not provable is assumed false. And we use negation as failure (NAF) (Clark, 1978), a rule of inference which allows one to deduce that NOT S is true if all possible proofs of a statement S fail. For example, in Figure 1, the NAF node before R 3 represents "the circuit does not have the switch". Note that under this semantics, negative facts and negative rules are not allowed because of redundancy under the CWA assumption.

Formulations
A proof is a directed acyclic graph ( Figure 1). Each node is either a fact, rule or special NAF node. Each edge directs from either a fact (or NAF) to a rule or a rule to another rule, which indicates that a fact is consumed by a rule, or another rule consumes a rule, respectively. For simplicity, let context C = {s 1 , . . . , s n } denote the collection of sentence, each of which is a fact or rule.

Proof Formulation
We assign an indicator variable (0/1) for each possible node and edge, to vectorize the structure of a given proof P . Specifically, we introduce the indicator variables for each element s i in the context C, and an indicator variable E = {E ij } n i,j=1 (i = j) for a possible edge connecting from node s i to node s j , where: • V i = 1 indicates s i is in the proof P , while V i = 0 means s i is absent.
• E ij = 1 indicates there is an edge directing from s i to s j , while E ij = 0 means s i cannot direct to s j in the proof by an edge.
In addition, we assign a binary answer variable A to indicate the true value of the query. Figure 2a shows a simplified example, where context C = {s 1 , s 2 , s 3 }, and the query can be decided as true by a very simple proof, consisting of only two nodes (s 1 and s 3 ) and a single edge (from s 1 to s 3 ). The proof can be represented by the following variables: (a) Proof graph and its induced random variables.
Factor graph induced by the proof graph.

Approach
We introduce the proposed framework PROBR, which jointly provides the answer to the given query over natural contexts and generates corresponding proof. Different from PROVER that makes independence assumption, PROBR can capture more dependencies between the proof and answer. PROBR defines a joint distribution over all possible proofs and answers with an undirected graphical model (Section 3.1), and we use neural networks to parameterize each component (Section 3.2). To optimize PROBR efficiently, we adopt a variational approach to maximize the pseudolikelihood of joint distribution (Section 3.3). Finally, we introduce the strategy during inference (Section 3.4).

Overview
We start by formalizing joint question-answering module and proof-generation module in a probabilistic way. We clarify some notations as follows: • A context C = {s 1 , . . . , s n }, s i is a sentence.
• A query Q.
• Answer variable A, it can take any value a in {0, 1}.
, each E ij can take any value e ij in {0, 1}.
• Let Y (A, E, V) denote all output variables.
In our notation, we use uppercase letters for variables (e.g., Y, A, V i , E ij ) and lowercase letters for variables that take values (e.g., y, a, v i , e ij ).
Given a context C and a query Q, PROBR tries to assign true/false values for all variables, including answer variable A, node variables V and edge variables E. We define a joint distribution over all possible Y , officially denoted as p(Y ): 2 Different from PROVER that makes independent assumption, such a factorization of Equation 1 can characterize the interaction between the variables V i , V j , E ij and A. Figure 2b shows the factor graph of joint distribution p(Y ) for the example in Figure  2a. Theoretically, when we have the ground truth y * , 3 we can minimize the following objective: However, the normalization constant of p(Y ) is hard to calculate due to high-order factors of large size (RHS of Equation 1). In this paper, we provide a variational-based solution to optimize objective L joint (Section 3.3).

Parameterization
We use neural networks to parameterize each potential function of Equation Text Representation Network Given a context C and a query Q, to obtain a contextual representations, we use RoBERTa (Liu et al., 2019) as our backbone network following (Clark et al., 2020;Saha et al., 2020). The input to RoBERTa is the concatenation of C and Q, separated by [SEP] tokens, denoted as: Potential Function for the Answer (Φ A ) After the RoBERTa encoding, we can get the global representation of the entire input through the first token [CLS], denoted as h [CLS] . To score the possible values of variable A, i.e. 0 or 1, we use a multilayer perceptron (MLP) as a nonlinear transformation: Potential Function for Statements (Φ V i ) For each sentence s i (a fact or a rule), we compute the sentence representation h s i by performing a mean pool of the all token representation based on the output of RoBERTa. It is worth noting that NAF is a special fact, we calculate h NAF through linear transformation on h CLS . To score the possible values of variables (V i , A), we also use another MLP as a score function: where the dimension 4 indicates the number of possible values for the combination of variables V i and A. We share the parameters of MLP 2 across all sentences.
Potential Function for Statement Relations (Φ E ij ) For each sentence pair (s i , s j ), we obtain the sentence pair representation h s i ,s j , by concatenating h s i and h s j with their element-wise difference (directionality). To score four variables (V i , V j , E ij , A) simultaneously, similarly, we use a new MLP as score function:  where ⊕ is the vector concatenation, and the dimension 16 indicates the number of possible values for the combination of four variables We also share the parameters of MLP 3 across all sentence pairs.

Learning the Model
To tackle the challenge of optimizing L joint (Equation 2), we adopt the widely used pseudolikelihood as an alternative objective for optimization (Richardson and Domingos, 2006), bypassing the calculation of the normalization constant.
Pseudolikelihood Given a set of variable Y , the pseudolikelihood of Y is defined as: When we have the ground truth y * , we can minimize the following objective: However, it is difficult to decode the optimal assignments based on the pseudolikelihood (Equation 3).
There is a rich body of literature on how to decoding in a sampling way (Chapter 12 (Salakhutdinov, 2014)). In this paper, however, we choose a modern approach using variational approximation.
Variational Approximation We approximate pseudolikelihood of Y with a mean-field (Opper and Saad, 2001) variational distribution q(Y ), in which y ∈ Y is independent of each other. Similarly, we parameterize each independent distribution with a neural network. Formally, q(Y ) is formulated as below: Once the variational distribution q(Y ) is obtained, it can provide conditions for pseudolikelihood p(y|Y −y ), thus avoiding the sampling process to obtain the optimal assignments. In the optimization process, we adopt the simple strategy to update the parameters of p and q.
• For node and edge variables, we optimize .
• For answer variable, we optimize The final objective is to minimize: Overall, PROBR is a mixture of independent (variational) model and undirected graphical model through some reasonable approximations. Our final optimized distribution can be decomposed as di- adopts the independent factorized probability, and p(A|E, V) is implied by the undirected graphical model (Equation 1). In this way, PROBR enjoys the advantage of global normalization (undirected graphical model) and is easier to optimize (directed graphical model).
Discussion Another way to achieve consensus between q(Y ) and p pseudo (Y ) is to directly optimize the KL divergence: However, L kl does not bring any improvement for supervised learning (Section 4.6), hence we exclude it during training. PROBR can be easily extended to semi-supervised learning scenario by using this L kl term. Specifically, minimize the L final for the labeled data; and minimize the L kl for the unlabeled data. We save this for future work.

Inference
After training, for nodes and edges, we choose the predictions of the variational model, and for answers, we choose the prediction of the joint model based on the output of variational model. In addition, we also employ the Integer Linear Programming (ILP) to enforce consistency constraints following (Saha et al., 2020).

Experiments
To evaluate the effectiveness and generality of our PROBR model, we conduct both fully supervised learning, few-shot learning, and zero-shot learning over several datasets 5 against two baselines: RuleTakers and PROVER 6 .
DU0-DU5 DUd (d=0,1,2,3,5) are five synthetic datasets, each containing 100k queries with theories expressed in templated English, proof graphs expressed in natural language, and answers described as True/False. Answers require reasoning up to depth d for queries in DUd.

Birds-Electricity
This dataset is a test-only dataset of 5k samples in total. It describes birds and electric circuit, which was used to evaluate the out-of-distribution performance of the models.
ParaRules ParaRules is a dataset generated and paraphrased from sampled theories (facts + rules). It contains 40k queries against ≈2k theories, where the original templated English facts and rules are creatively paraphrased into more diverse natural language by crowdsourcing. For example, the fact "Dave is cold" can be rephrased as"After Dave got wet in the rain, he feels cold"; the rule "If someone is nice then they are young" can be rephrased into "A person described as being nice will certainly be young". Different from DUd and Birds-Electricity dataset composed of synthetic language, ParaRules can better test models' reasoning ability over human-like language.
Metrics We evaluate the performance considering both answers and proofs. For answers, we evaluate the QA Accuracy (QA). For proofs, we evaluate the Proof Accuracy (PA), and PA refers to the fraction of examples where generated proof matches exactly with the gold proof. We also report Full Accuracy (FA) to denote the faction of examples where both the answer and the proof are exactly correct.

Fully Supervised Learning
For the supervised setting, we train PROBR on the training split of the DU5 dataset with gold answer and gold proof and evaluate on the test split of DU5. We evaluate above metrics of varying depths d against two state-of-the-art baselines: RuleTakers (Clark et al., 2020) and PROVER (Saha et al., 2020), showed in Table 1. For RuleTakers and PROVER, we directly adopt the results reported in their paper. and PROVER, we directly adopt results reported in their papers if exist, and for extra setting beyond papers, we reproduce the baselines using provided codes and parameters:https://github.com/swarnaHub/PRover. Note that RuleTakers can not generate a proof, so we only report the PA and FA on PROVER and PROBR. The corresponding validation set results can be found in the supplementary materials .
Overall, at each depth, PROBR generates comparable or superior QA accuracy to baselines. And for 88.8% of test examples, PROBR can generate exact proofs and answers. Similar to PROVER, the full accuracy matches the proof accuracy for PROBR, showing that in this fully supervised setting, full accuracy depends on proof accuracy at each depth. The predicted answer is always correct when the corresponding proof is correct. Actually, answering predicting is much easier than a proof generation.
When increasing depth, PROBR provides accurate answers without any loss in QA performance. It becomes harder to generate correct proofs for both PROVER and PROBR, while PROBR outperforms PROVER by 7 points of proof accuracy (65.1% → 72.2%) at depth 5.

Few-shot Learning
We explore the few-shot learning ability of PROBR against PROVER by reducing training data size. For the sake of comparison, we follow the same setting in (Saha et al., 2020), that is, randomly reserve 30k, 10k, 1k queries of overall 69762 training queries to train the model, denoted as "RQ".
It's worth noting that in the DU5 training dataset, several queries can be asked from a shared context. To better explore the ability when varying the amount of training data, we conduct another set of experiments, denoted as "RC". Specifically, we first randomly select context that appeared in the DU5 training dataset by a varying percentage, i.e., 10%, 5%, 1%, and then reserve training samples where the query is asked from the selected context. Results of both "RQ" and "RC" are showed  in Table 2. Generally speaking, proof generation is harder to improve with increased training data, while QA performance improves rapidly by enlarging the training size. PROBR widely defeats PROVER on QA accuracy in each setting in Table 2. Surprisingly, PROBR achieves 88.2% QA accuracy when training with only 700 samples (RC-1%). Overall, PROBR has a more stable ability for question answering when varying training data; however, PROVER's QA accuracy drops sharply when lacking training data. This is because that PROBR considers the joint distribution over all possible proofs and answers, and can better learn to reason over natural language statements. While as for proof accuracy, even if in some settings, PROBR loses to PROVER (RC-1%), we will soon discover that PROVER overfits to the small training data (Section 4.4 and 4.5).
Another interesting observation is that the full accuracy is not always consistent with the proof accuracy in few-shot learning, which is different from the observation in Section 4.2. Furthermore, we find that the gap between PA and FA when using PROBR is much smaller than that of PROVER. This is because PROVER trains in a multi-task way, where the question answering module and proof generation module could make independent errors, especially when training data is not enough. But PROBR can better utilize limited data to reason, which again verifies the effectiveness of PROBR.

Zero-shot Evaluation
Following previous work (Clark et al., 2020;Saha et al., 2020), we evaluate the out-of-distribution (OOD) performance of PROBR against baselines on six sub-datasets of Birds-Electricity. We con-   Table 3. For QA accuracy, PROBR outperforms PROVER and RuleTakers obviously in all of sub-datasets. As for proof accuracy, PROBR performs better when the depth of the out-of-domain sample ≤ 3, while there is a PA drop compared to PROVER when testing on E4. This is a very interesting thing: superficially, proof accuracy drops for complicated unseen queries, but the QA accuracy for out-ofdomain queries improves a lot (11 points on E4: 84.8% → 95.6%). We save it for future work to explore the portability of the proof and how an outof-domain proof can help with question answering.
Moreover, we evaluate the zero-shot performance after few-shot learning. In Table 4, we report the results when testing on Birds-Electricity after training the model only on partial DU5 (RCk and RQ-k, described in 4.3) training partitions. As shown in Table 4, when testing zero-shot performance after few-shot learning, PROBR is well ahead of PROVER on QA accuracy. However, as for proof accuracy, PROBR seems worth than PROVER on the zero-shot test. Again we point out this amazing observation. This indicates that data from different domains might have different proof form. The well-learned proof from one domain might not be directly adopted to another, but, by training with PROBR, the well-learned proof from one domain can help answer out-of-distribution queries.

Generalization Ability
Generalize to Unseen Depth We conduct experiments to explore how well PROBR can generate proofs and provide answers at depths unseen during training. Following PROVER, we train the model on the training splits of DU0, DU1, DU2, and   DU3, respectively, and test the QA performance and proof performance on the overall DU5 test set. As DU5 contains queries with higher depth than those seen during training, we can evaluate the model's ability when generalized to higher depth. As shown in Table 5, PROBR performs better than RuleTakers and PROVER on all of QA/PA/FA performance when training on D1, D2 and D3, especially a significant improvement on QA performance. PROBR shows a high and comparable QA performance when training only on depth=1 (97.7%), which demonstrates PROBR's superior generalization ability on depth. This means PROBR can perfectly answer complicated queries using only simple training samples, which reduces the cost of constructing training data.

Generalize to Complex Language
We also evaluate the robustness of PROBR when generalized to more diverse natural language. Following (Clark et al., 2020;Saha et al., 2020), we train our model on the combined training partitions of DU3 and ParaRules, and then test on the ParaRules test partition. The results in Table 6 show that PROBR is more robust for human-like language.
To better test the generalization ability for complex natural language, we train the model only on DU5 or partial DU5 (RC-k and RQ-k, de-   scribed in 4.3) training partitions and test on test split of ParaRules. This is a more convincing setup since the model will never see the humanlike language but all templated language during training. Results are shown in Table 7. When testing on ParaRules after only training on DU5, PROBR outperforms PROVER by nearly 30 points on QA accuracy(53.6% → 82.8%). A similar trend is observed for training on RC-k and RQk datasets, where PROBR improves the QA accuracy when generalized to human-like natural language. And the change for proof accuracy is not significant between PROBR and PROVER, which supports the observation in Section 4.3 and 4.4, that PROBR improves QA performance by joint question-answering and proof-generation learning, but not necessarily improve the proof performance.

Ablation Studies
We investigate the effect of training strategy and objective term L kl for our model. Specifically we compare PROBR with the following three variants: 1) PROBR + Gold, that is, we replace predicted proofs with gold proofs when we optimize L qa during training. 2) PROBR + KL, that is, we add L kl between q(Y ) and p pseudo (Y ) during training. 3)PROBR + Gold + KL means both. For PROBR and above three variants, we first train on DU5 or partial DU5 (RC-k) training splits respectively, and Figure 3 reports the QA accuracy on test split of DU5 (left), ParaRules test partitions (middle) and Birds-Electricity (right). We observe that PROBR always achieves the best QA accuracy on all of three test datasets (DU5, ParaRules, Birds-Electricity) after training on all of four datasets with varying size (RC-1%, RC-5%, RC-10%, RC-100%). And the other three model variants show inconsistent performance in different settings 7 .

Related Work
Text Reasoning over Formal Representation Early work employs a pipeline of methods that converts free text into logic form first (semantic parsing), and then uses formal logical reasoning (Musen and Van der Lei, 1988). Due to the serious error propagation caused by semantic parsing (Zettlemoyer and Collins, 2005;Berant et al., 2013;Berant and Liang, 2014), researchers focus on developing theorem provers by combining the symbolic techniques with the differentiable learning from neural networks (Reed and de Freitas, 2016;Abdelaziz et al., 2020;Abboud et al., 2020), such as NLProlog (Weber et al., 2019), SAT solving (Selsam et al., 2019) and Neural programme (Neelakantan et al., 2016). To bypass this expensive and error-prone intermediate logical representation, reasoning over natural language statements in an end-to-end manner is promising.
Text Reasoning over Natural Language Natural logic (MacCartney and Manning, 2009) focuses on semantic containment and monotonicity by incorporating semantic exclusion and implicativity. Subsequently, Clark et al. (2020) proposes to use a Transformer-based model to emulate deductive reasoning and achieves high accuracy on synthetically generated data. PROVER (Saha et al., 2020) points out that a reasoning system should not only answer queries but also generate a proof. However, PROVER adopts the multi-task learning framework in the training stage and cannot effectively capture the interactions between question answering and proof generation. Along this line, we explore more powerful joint models to achieve deep reasoning. QA and NLI There are bAbI (Weston et al., 2016), QuaRTz , ROPES  and Hotpot QA (Yang et al., 2018) (QA datasets) involved in rule reasoning. However, for those datasets, implicit rules (i.e., which multihop chains are valid) need to be inferred from the training data. In our task, the rules of reasoning are given in advance. Compared with the Natural Language Inference (MacCartney and Manning, 2014), our task can be regarded as its deductive subset. In particular, NLI allows for unsupported inferences (Dagan et al., 2013).

Conclusion
In this work, we propose PROBR, a novel probabilistic graph reasoning framework for joint question answering and proof generation. PROBR defines a joint distribution over all possible answers and proofs, which can directly characterize the interaction between answers and proofs. Experiments prove the effectiveness of proposed PROBR.        1. In all ablation experiments, PROBR achieved the best QA performance, demonstrating that PROBR can capture critical information for question answering in a variety of settings. However, since some of the dataset are artificially synthesized, it is difficult to guarantee that PROBR will work in the real dataset as well. We leave it as future work.

C Results of Ablation Studies
2. In some cases, variant d) (PROBR + Gold + KL) outperforms PROBR in PA and FA. It shows the potential advantages of the KL term. In the future, we will explore proof generation in a semisupervised learning scenario through this KL term.
3. When we compare the performance of the two models PROBR and PROBR + Gold, we can see that whether the predicted proof or the correct proof is used during training significantly affects the final performance. Applying some heuristic strategies may give better results, such as scheduled sampling (Bengio et al., 2015). We will try it in the future.