Learning to Reason Deductively: Math Word Problem Solving as Complex Relation Extraction

Solving math word problems requires deductive reasoning over the quantities in the text. Various recent research efforts mostly relied on sequence-to-sequence or sequence-to-tree models to generate mathematical expressions without explicitly performing relational reasoning between quantities in the given context. While empirically effective, such approaches typically do not provide explanations for the generated expressions. In this work, we view the task as a complex relation extraction problem, proposing a novel approach that presents explainable deductive reasoning steps to iteratively construct target expressions, where each step involves a primitive operation over two quantities defining their relation. Through extensive experiments on four benchmark datasets, we show that the proposed model significantly outperforms existing strong baselines. We further demonstrate that the deductive procedure not only presents more explainable steps but also enables us to make more accurate predictions on questions that require more complex reasoning.


Introduction
Math word problem (MWP) solving (Bobrow, 1964) is a task of answering a mathematical question that is described in natural language. Solving MWP requires logical reasoning over the quantities presented in the context (Mukherjee and Garain, 2008) to compute the numerical answer. Various recent research efforts regarded the problem as a generation problem -typically, such models focus on generating the complete target mathematical expression, often represented in the form of a linear sequence or a tree structure (Xie and Sun, 2019). Figure 1 (top) depicts a typical approach that attempts to generate the target expression in the 1 Our code and data are released at https://github. com/allanj/Deductive-MWP.
Question: In a division sum , the remainder is 8 and the divisor is 6 times the quotient and is obt--ained by adding 3 to the thrice of the remainder. What is the divident? form of a tree structure, which is adopted in recent research efforts (Xie and Sun, 2019;Patel et al., 2021;Wu et al., 2021). Specifically, the output is an expression that can be obtained from such a generated structure. We note that, however, there are several limitations with such a structure generation approach. First, such a process typically involves a particular order when generating the structure. In the example, given the complexity of the problem, the decision of generating the addition ("+") operation as the very first step could be counter-intuitive and does not provide adequate explanations that show the reasoning process when being presented to a human learner. Furthermore, the resulting tree contains identical sub-trees ("8 × 3 + 3") as highlighted in blue dashed boxes. Unless a certain specifically designed mechanism is introduced for reusing the already generated intermediate expression, the approach would need to repeat the same effort in its process for generating the same sub-expression.
Solving math problems generally requires deductive reasoning, which is also regarded as one of the important abilities in children's cognitive development (Piaget, 1952). In this work, we propose a novel approach that explicitly presents deductive reasoning steps. We make a key observation that MWP solving fundamentally can be viewed as a complex relation extraction problem -the task of identifying the complex relations among the quantities that appear in the given problem text. Each primitive arithmetic operation (such as addition, subtraction) essentially defines a different type of relation. Drawing on the success of some recent models for relation extraction in the literature (Zhong and Chen, 2021), our proposed approach involves a process that repeatedly performs relation extraction between two chosen quantities (including newly generated quantities).
As shown in Figure 1, our approach directly extracts the relation ("multiplication", or "×") between 8 and 3, which come from the contexts "remainder is 8" and "thrice of the remainder". In addition, it allows us to reuse the results from the intermediate expression in the fourth step. This process naturally yields a deductive reasoning procedure that iteratively derives new knowledge from existing ones. Designing such a complex relation extraction system presents several practical challenges. For example, some quantities may be irrelevant to the question while some others may need to be used multiple times. The model also needs to learn how to properly handle the new quantities that emerge from the intermediate expressions.
Learning how to effectively search for the optimal sequence of operations (relations) and when to stop the deductive process is also important.
In this work, we tackle the above challenges and make the following major contributions: • We formulate MWP solving as a complex relation extraction task, where we aim to repeatedly identify the basic relations between different quantities. To the best of our knowledge, this is the first effort that successfully tackles MWP solving from such a new perspective.
• Our model is able to automatically produce explainable steps that lead to the final answer, presenting a deductive reasoning process.
• Our experimental results on four standard datasets across two languages show that our model significantly outperforms existing strong baselines. We further show that the model per-forms better on problems with more complex equations than previous approaches.

Related Work
Early efforts focused on solving MWP using probabilistic models with handcrafted features (Liguda and Pfeiffer, 2012). Kushman et al. (2014) and Roy and Roth (2018) designed templates to find the alignments between the declarative language and equations. Most recent works solve the problem by using sequence or tree generation models. Wang et al. (2017) proposed the Math23k dataset and presented a sequence-to-sequence (seq2seq) approach to generate the mathematical expression (Chiang and Chen, 2019). Other approaches improve the seq2seq model with reinforcement learning (Huang et al., 2018), template-based methods , and group attention mechanism (Li et al., 2019). Xie and Sun (2019) proposed a goal-driven tree-structured (GTS) model to generate the expression tree. This sequence-to-tree approach significantly improved the performance over the traditional seq2seq approaches. Some follow-up works incorporated external knowledge such as syntactic dependency (Shen and Jin, 2020; or commonsense knowledge .  modeled the equations as a directed acyclic graph to obtain the expression.  and Li et al. (2020) adopted a graph-to-tree approach to model the quantity relations using the graph convolutional networks (GCN) (Kipf and Welling, 2017). Applying pre-trained language models such as BERT (Devlin et al., 2019) was shown to significantly benefit the tree expression generation (Lan et al., 2021;Tan et al., 2021;Shen et al., 2021). Different from the tree-based generation models, our work is related to deductive systems (Shieber et al., 1995;Nederhof, 2003) where we aim to obtain step-by-step expressions. Recent efforts have also been working towards this direction. Ling et al. (2017) constructed a dataset to provide explanations for expressions at each step. Amini et al. (2019) created the MathQA dataset annotated with step-by-step operations. The annotations present the expression at each intermediate step during problem-solving. Our deductive process ( Figure 1) attempts to automatically obtain the expression in an incremental, step-by-step manner.
Our approach is also related to relation extraction (RE) (Zelenko et al., 2003), a fundamental task in Figure 2: Our deductive system. t is the current step. ⟨⋅⟩ denotes the quantity list. the field of information extraction that is focused on identifying the relationships between a pair of entities. Recently, Zhong and Chen (2021) designed a simple and effective approach to directly model the relations on the span pair representations. In this work, we treat the operation between a pair of quantities as the relation at each step in our deductive reasoning process. Traditional methods (Liang et al., 2018) applied rule-based approaches to extract the mathematical relations.
MWP solving is typically regarded as one of the system 2 tasks (Kahneman, 2011;Bengio et al., 2021), and our current approach to this problem is related to neural symbolic reasoning (Besold et al., 2017). We design differentiable modules (Andreas et al., 2016;Gupta et al., 2020) in our model ( §3.2) to perform reasoning among the quantities.

Approach
The math word problem solving task can be defined as follows. Given a problem description S = {w 1 , w 2 , ⋯, w n } that consists of a list of n words and Q S = {q 1 , q 2 , ⋯, q m }, a list of m quantities that appear in S, our task is to solve the problem and return the numerical answer. Ideally, the answer shall be computed through a mathematical reasoning process over a series of primitive mathematical operations (Amini et al., 2019) as shown in Figure 1. Such operations may include "+" (addition), "−" (subtraction), "×" (multiplication), "÷" (division), and " * * " (exponentiation). 2 In our view, each of the primitive mathematical operations above can essentially be used for describing a specific relation between quantities. Fundamentally, solving a math word problem is a problem of complex relation extraction, which requires us to repeatedly identify the relations between quantities (including those appearing in the text and those intermediate ones created by relations). The overall solving procedure requires in-voking a relation classification module at each step, yielding a deductive reasoning process.
In practice, some questions cannot be answered without relying on certain predefined constants (such as π and 1) that may not have appeared in the given problem description. We therefore also consider a set of constants C = {c 1 , c 2 , ⋯, c |C| }. Such constants are also regarded as quantities (i.e., they would be regarded as {q m+1 , q m+2 , . . . , q m+|C| }) which may play useful roles when forming the final answer expression.

A Deductive System
As shown in Figure 1, applying the mathematical relation (e.g., "+") between two quantities yields an intermediate expression e. In general, at step t, the resulting expression e (t) (after evaluation) becomes a newly created quantity that is added to the list of candidate quantities and is ready for participating in the remaining deductive reasoning process from step t + 1 onward. This process can be mathematically denoted as follows: • Initialization: i,j,op represents the expression after applying the relation op to the ordered pair (q i , q j ). Following the standard deduction systems (Shieber et al., 1995;Nederhof, 2003), the reasoning process can be formulated in Figure 2. We start with an axiom with the list of quantities in Q (0) . The inference rule is q i op − → q j as described above to obtain the expression as a new quantity at step t.

Model Components
Reasoner Figure 3 shows the deductive reasoning procedure in our model for an example that involves 3 quantities. We first convert the quantities (e.g., 2, 088) into a general quantity token "<quant>". We next adopt a pre-trained language model such as BERT (Devlin et al., 2019) or Roberta (Cui et al., 2019; to obtain the quantity representation q for each quantity q.
If a machine can make 2,088 gears in 8 hours, how many gears it make in 9 hours? We show the inference procedure to obtain the expression "q 1 ÷ q 2 × q 3 " for the example question.
Given the quantity representations, we consider all the possible quantity pairs, (q i , q j ). Similar to Lee et al. (2017), we can obtain the representation of each pair by concatenating the two quantity representations and the element-wise product between them. As shown in Figure 3, we apply a non-linear feed-forward network (FFN) on top of the pair representation to get the representation of the newly created expression. The above procedure can be mathematically written as: where e i,j,op is the representation of the intermediate expression e and op is the operation (e.g., "+", "−") applied to the ordered pair (q i , q j ). FFN op is an operation-specific network that gives the expression representation under the particular operation op. Note that we have the constraint i ≤ j. As a result we also consider the "reverse operation" for division and subtraction (Roy and Roth, 2015). As shown in Figure 3, the expression e 1,2,÷ will be regarded as a new quantity with representation q 4 at t = 1. In general, we can assign a score to a single reasoning step that yields the expression e (t) i,j,op from q i and q j with operation op. Such a score can be calculated by summing over the scores defined over the representations of the two quantities and the score defined over the expression: where we have:

Rationalizer Mechanism
Multi-head Self-Attention Attention  where s q (⋅) and s e (⋅) are the scores assigned to the quantity and the expression, respectively, and w q and w e are the corresponding learnable parameters.
Our goal is to find the optimal expression sequence [e (1) , e (2) , ⋯, e (T ) ] that enables us to compute the final numerical answer, where T is the total number of steps required for this deductive process.
Terminator Our model also has a mechanism that decides whether the deductive procedure is ready to terminate at any given time. We introduce a binary label τ , where 1 means the procedure stops here, and 0 otherwise. The final score of the expression e at time step t can be calculated as: where w τ is the parameter vector for scoring the τ .
Rationalizer Once we obtain a new intermediate expression at step t, it is crucial to update the representations for the existing quantities. We call this step rationalization because it could potentially give us the rationale that explains an outcome (Lei et al., 2016). As shown in Figure 4, the intermediate expression e serves as the rationale that explains how the quantity changes from q to q ′ . Without this step, there is a potential shortcoming for the model. That is, because if the quantity representations do not get updated as we continue the deductive reasoning process, those expressions that were initially highly ranked (say, at the first step) would always be preferred over those lowly ranked ones throughout the process. 3 We rationalize the quantity representation using the current intermediate expression e (t) , so that the quantity is aware of the generated expressions when its representation gets updated. This procedure can be mathematically formulated as follows:  Two well-known techniques we can adopt as rationalizers are multi-head self-attention (Vaswani et al., 2017) and a gated recurrent unit (GRU) (Cho et al., 2014) cell, which allow us to update the quantity representation, given the intermediate expression representation. Table 1 shows the mechanism in two different rationalizers. For the first approach, we essentially construct a sentence with two token representations -quantity q i and the previous expression e -to perform self-attention. In the second approach, we use q i as the input state and e as the previous hidden state in a GRU cell.

Training and Inference
Similar to training sequence-to-sequence models (Luong et al., 2015), we adopt the teacherforcing strategy (Williams and Zipser, 1989) to guide the model with gold expressions during training. The loss 4 can be written as: where θ includes all parameters in the deductive reasoner and H (t) contains all the possible choices of quantity pairs and relations available at time step t. λ is the hyperparameter for the L 2 regularization term. The set H (t) grows as new expressions are constructed and become new quantities during the deductive reasoning process. The overall loss is computed by summing over the loss at each time step (assuming totally T steps). During inference, we set a maximum time step T max and find the best expression e * that has the highest score at each time step. Once we see τ = 1 is chosen, we stop constructing new expressions 4 Actually, one might have noticed that this loss comes with a trivial solution at θ = 0. In practice, however, our model and training process would prevent us from reaching such a degenerate solution with proper initialization (Goodfellow et al., 2016). This is similar to the training of a structured perceptron (Collins, 2002), where a similar situation is also involved. and terminate the process. The overall expression (formed by the resulting expression sequence) will be used for computing the final numerical answer. Declarative Constraints Our model repeatedly relies on existing quantities to construct new quantities, which results in a structure showing the deductive reasoning process. One advantage of such an approach is that it allows certain declarative knowledge to be conveniently incorporated. For example, as we can see in Equation 6, the default approach considers all the possible combinations among the quantities during the maximization step. We can easily impose constraints to avoid considering certain combinations. In practice, we found in certain datasets such as SVAMP, there does not exist any expression that involve operations applied to the same quantity (such as 9 + 9 or 9 × 9, where 9 is from the same quantity in the text). Besides, we also observe that the intermediate results would not be negative. We can simply exclude such cases in the maximization process, effectively reducing the search space during both training and inference. We show that adding such declarative constraints can help improve the performance.  (Patel et al., 2021). The dataset statistics can be found in Table 2

S2S
GroupAttn (Li et al., 2019) 76.1 Transformer (Vaswani et al., 2017) 85.6 BERT-BERT (Lan et al., 2021) 86.9 Roberta-Roberta (Lan et al., 2021) 88.4 S2T/G2T GTS (Xie and Sun, 2019) 82.6 Graph2Tree  85.6 Roberta-GTS (Patel et al., 2021) 88.5 Roberta-Graph2Tree (Patel et al., 2021) 88.7 OURS BERT-DEDUCTREASONER 91.2 (± 0.16) ROBERTA-DEDUCTREASONER 92.0 (± 0.20) MBERT-DEDUCTREASONER 91.6 (± 0.13) XLM-R-DEDUCTREASONER 91.6 (± 0.11) adapt the dataset to filter out some questions that are unsolvable. We consider the operations "addition", "subtraction", "multiplication", and "division" for MAWPS and SVAMP, and an extra "exponentiation" for MathQA and Math23k. The number of operations involved in each question can be one of the indicators to help us gauge the difficulty of a dataset. Figure 5 shows the percentage distribution of the number of operations involved in each question. The MathQA dataset generally contains larger portions of questions that involve more operations, while 97% of the questions in MAWPS can be answered with only one or two operations. More than 60% of the instances in MathQA have three or more operations, which likely makes their problems harder to solve. Furthermore, MathQA (Amini et al., 2019) contains GRE questions in many domains including physics, geometry, probability, etc., while Math23k questions are from primary school. Different from other datasets, SVAMP (Patel et al., 2021) 7 is a challenging set that is manually created to evaluate a model's robustness. They applied variations over the instances sampled from MAWPS. Such variations could be: adding extra quantities, swapping the positions between noun phrases, etc.
Training Details We adopt BERT (Devlin et al., 2019) and Roberta  for the English datasets. Chinese BERT and Chinese Roberta (Cui et al., 2019) are used for Math23k. We use the GRU cell as the rationalizer. We also conduct experiments with multilingual BERT and XLM-Roberta (Conneau et al., 2020). The pre-trained models are initialized from HuggingFace's Transformers (Wolf et al., 2020). We optimize the loss with the Adam optimizer (Kingma and Ba, 2014; Loshchilov and Hutter, 2019). We use a learning rate of 2e-5 and a batch size of 30. The regularization coefficient λ is set to 0.01. We run our models with 5 random seeds and report the average results (with standard deviation). Following most previous works, we mainly report the value accuracy (percentage) in our experiments. In other words, a prediction is considered correct if the predicted expression leads to the same value as the gold expression. Following previous practice Tan et al., 2021;Patel et al., 2021), we report Model Val Acc.

MAWPS and Math23k
We first discuss the results on MAWPS and Math23k, two datasets that are commonly used in previous research. Table  3 and 4 show the main results of the proposed models with different pre-trained language models. We compare with previous works that have reported results on these datasets. Among all the encoders for our model DEDUCTREASONER, the Roberta encoder achieves the best performance. In addition, DEDUCTREASONER significantly outperforms all the baselines regardless of the choice of encoder. The performance on the best S2S model (Roberta-Roberta) is on par with the best S2T model (Roberta-Graph2Tree) on MAWPS. Overall, the accuracy of Roberta-based DEDUCTREA-SONER is more than 3 points higher than Roberta-Graph2Tree (p < 0.001) 9 on MAWPS, and more than 2 points higher than BERT-Tree (p < 0.005) on Math23k. The comparisons show that our deductive reasoner is robust across different languages and datasets of different sizes.
MathQA and SVAMP As mentioned before, MathQA and SVAMP are more challenging -the former consists of more complex questions and the latter consists of specifically designed challenging questions. Table 5 and 6 show the performance comparisons. We are able to outperform the best baseline mBERT-LSTM 10 by 1.5 points in accuracy on MathQA. Different from other three datasets, the performance between different language models shows larger gaps on SVAMP. As we can see Model Val Acc.

S2S
GroupAttn (Li et al., 2019) 21.5 BERT-BERT (Lan et al., 2021) 24.8 Roberta-Roberta (Lan et al., 2021) 30.3 S2T/G2T GTS * (Xie and Sun, 2019) 30.8 Graph2Tree  36.5 BERT-Tree  32.4 Roberta-GTS (Patel et al., 2021) 41.0 Roberta-Graph2Tree (Patel et al., 2021) 43  from baselines and our models, the choice of encoder appear to be important for solving questions in SVAMP -the results on using Roberta as the encoder are particularly striking. Our best variant ROBERTA-DEDUCTREASONER achieves an accuracy score of 47.3 and is able to outperfrom the best baseline (Roberta-Graph2Tree) by 3.5 points (p < 0.01). By incorporating the constraints from our prior knowledge (as discussed in §3.3), we observe significant improvements for all variants -up to 7.0 points for our BERT-DEDUCTREASONER. Overall, these results show that our model is more robust as compared to previous approaches on such challenging datasets.

Fine-grained Analysis
We further perform finegrained performance analysis based on questions with different numbers of operations. Table 7 shows the accuracy scores for questions that involve different numbers of operations. It also shows the equation accuracy on all datasets 11 . We compared our ROBERTA-DEDUCTREASONER with the best performing baselines in Table 3 (Roberta-Graph2Tree), 4 (BERT-Tree), 5 (mBERT+LSTM) and 6 (Roberta-Graph2Tree). On MAWPS and Math23k, our ROBERTA-DEDUCTREASONER model consistently yields higher results than baselines. On MathQA, our model also performs better on questions that involve 2, 3, and 4 operations. For the other more challenging dataset SVAMP, our model   has comparable performance with the baseline on 1-step questions, but achieves significantly better results (+14.3 points) on questions that involve 2 steps. Such comparisons on MathQA and SVAMP show that our model has a robust reasoning capability on more complex questions. We observe that all models (including ours and existing models) are achieving much lower accuracy scores on SVAMP, as compared to other datasets. We further investigate the reason for this. Patel et al. (2021) added irrelevant information such as extra quantities in the question to confuse the models. We quantify the effect by counting the percentage of instances which have quantities unused in the equations. As we can see in Table 8, SVAMP has the largest proportion (i.e., 44.5%) of instances whose gold equations do not fully utilize all the quantities in the problem text. The performance also significantly drops on those questions with more than one unused quantity on all datasets. The analysis suggests that our model still suffer from extra irrelevant information in the question and the performance is severely affected when such irrelevant information appears more frequently.
Effect of Rationalizer Table 9 shows the performance comparison with different rationalizers. As described in §3.2, the rationalizer is used to update the quantity representations at each step, so as to better "prepare them" for the subsequent reasoning process given the new context. We believe this step is crucial for achieving good performance, especially for complex MWP solving. As shown in Table 9, the performance drops by 7.3 points in value  Table 9: Performance comparison on different rationalizer using the Roberta-base model.
Question: Xiaoli and Xiaoqiang typed a manuscript together. Their typing speed ratio was 5:3. Xiaoli typed 1,400 more words than Xiaoqiang. How many words are there in this manuscript?
Gold Expr: accuracy for Math23k without rationalization, confirming the importance of rationalization in solving more complex problems that involve more steps. As most of the questions in MAWPS involve only 1-step questions, the significance of using rationalizer is not fully revealed on this dataset.
It can be seen that using self-attention achieves worse performance than the GRU unit. We believe the lower performance by using multi-head attention as rationalizer may be attributed to two reasons. First, GRU comes with sophisticated internal gating mechanisms, which may allow richer representations for the quantities. Second, attention, often interpreted as a mechanism for measuring similarities (Katharopoulos et al., 2020), may be inherently biased when being used for updating quantity representations. This is because when measuring the similarity between quantities and a specific expression (Figure 4), those quantities that have just participated in the construction of the expression may receive a higher degree of similarity.

Case Studies
Explainability of Output Figure 6 presents an example prediction from Math23k. In this question, the gold deductive process first obtains the speed difference by "5 ÷ (5 + 3) − 3 ÷ (5 + 3)" and the final answer is 1400 divided by this difference. On the other hand, the predicted deductive process offers a slightly different understanding in speed difference. Assuming speed can be measured by some abstract "units", the predicted deductive process first performs subtraction between 5 and 3, which gives us "2 units" of speed difference. Next, we can obtain the number of words associated with each speed unit (1400÷2). Finally, we can arrive at the total number of words by multiplying the number of words per unit (700) and the total number of units (8). 12 Through such an example we can see that our deductive reasoner is able to produce explainable steps to understand the answers.

Question Perturbation
The model predictions also give us guidance to understand the errors. Figure 7 shows how we can perturb a question given the error prediction (taken from Math23k). As we can see, the first step is incorrectly predicted with the "+" relation between 255 and 35. Because the first step involves the two quantities in the first two sentences, where we can locate the possible cause for the error. The gold step has a probability of 0.062 which is somewhat lower than the incorrect prediction. We believe that the second sentence (marked in red) may convey semantics that can be challenging for the model to digest, resulting in the incorrect prediction. Thus, we perturb the second sentence to make it semantically more straightforward (marked below in blue). The probability for the sub-expression 225 − 35 becomes higher after the purtubation, leading to a correct prediction (the "−" relation). Such an analysis demonstrates the strong interpretability of our deductive reasoner, and highlights the important connection between math word problem solving and reading comprehension, a topic that has been studied in educational psychology (Vilenius-Tuohimaa et al., 2008).

Practical Issues
We discuss some practical issues with the current model in this section. Similar to most previous re-search efforts (Li et al., 2019;Xie and Sun, 2019), our work needs to maintain a list of constants (e.g., 1 and π) as additional candidate quantities. However, a large number of quantities could lead to a large search space of expressions (i.e., H). In practice, we could select some top-scoring quantities and build expressions on top of them (Lee et al., 2018). Another assumption of our model, as shown in Figure 3, is that only binary operators are considered. Actually, extending it to support unary or ternary operators can be straightforward. Handling unary operators would require the introduction of some unary rules, and a ternary operator can be defined as a composition of two binary operators. Our current model performs the greedy search in the training and inference process, which could be improved with a beam search process. One challenge with designing the beam search algorithm is that the search space H (t) is expanding at each step t (Equation 6). We empirically found the model tends to favor outputs that involve fewer reasoning steps. In fact, better understanding the behavior and effect of beam search in seq2seq models remains an active research topic (Cohen and Beck, 2019;Koehn and Knowles, 2017;Hokamp and Liu, 2017), and we believe how to perform effective beam search in our setup could be an interesting research question that is worth exploring further.

Conclusion and Future Work
We provide a new perspective to the task of MWP solving and argue that it can be fundamentally regarded as a complex relation extraction problem. Based on this observation, and motivated by the deductive reasoning process, we propose an end-toend deductive reasoner to obtain the answer expression in a step-by-step manner. At each step, our model performs iterative mathematical relation extraction between quantities. Thorough experiments on four standard datasets demonstrate that our deductive reasoner is robust and able to yield new state-of-the-art performance. The model achieves particularly better performance for complex questions that involve a larger number of operations. It offers us the flexibility in interpreting the results, thanks to the deductive nature of our model. Future directions that we would like to explore include how to effectively incorporate commonsense knowledge into the deductive reasoning process, and how to facilitate counterfactual reasoning (Richards and Sanderson, 1999).