Sequence to General Tree: Knowledge-Guided Geometry Word Problem Solving

With the recent advancements in deep learning, neural solvers have gained promising results in solving math word problems. However, these SOTA solvers only generate binary expression trees that contain basic arithmetic operators and do not explicitly use the math formulas. As a result, the expression trees they produce are lengthy and uninterpretable because they need to use multiple operators and constants to represent one single formula. In this paper, we propose sequence-to-general tree (S2G) that learns to generate interpretable and executable operation trees where the nodes can be formulas with an arbitrary number of arguments. With nodes now allowed to be formulas, S2G can learn to incorporate mathematical domain knowledge into problem-solving, making the results more interpretable. Experiments show that S2G can achieve a better performance against strong baselines on problems that require domain knowledge.


Introduction
Math word problem (MWP) solving is a special subfield of question answering. It requires machine solvers to read the problem text, understand it, and then compose the numbers and operators into a meaningful equation (as shown in Table 1). This process, even for the simplest problem in elementary school, demands language understanding and numerical reasoning capabilities, making this task a long-standing challenge in AI (Bobrow, 1964;. As with any QA task, solving an MWP requires the introduction of external knowledge or domain knowledge (Mishra et al., 2020). However, current state-of-the-art solvers (Xie and Sun, 2019; do not address this Problem: The outer radius and the inner radius of a circular annulus are 5m and 3m repsectively. Find the area of this circular annulus.
Equation: x = 5 * 5 * 3.14 − 3 * 3 * 3.14 Answer: 50.24 With formula: x = circle area(5) -circle area(3)   Table 1. issue explicitly. They learn to map the problem text into binary expression trees regardless of whether it requires any knowledge. This is counterintuitive for problems that need math concepts or formulas. As illustrated in Figure 1(a), without explicitly using the corresponding area formula, the expression tree for the problem is lengthy and uninterpretable.
To address this issue, we propose a sequenceto-general tree (S2G) architecture where the nodes can be arbitrary math concepts or formulas with arbitrary number of arguments. In this way, our S2G model can learn to map the problem text into executable operation trees that contain different formulas across different domains. For example, S2G can learn to generate tree nodes that contain the required geometry formula for circles, as shown in Figure 1(b), making the result more intuitive and explainable.
In addition, we propose a knowledge-guided mechanism to guide tree-decoding using a mathematical knowledge graph (KG). To evaluate our model, we also construct a middle-sized dataset consisting of 1,398 geometry word problems which require a diversified set of formulas. Experimental results show that our S2G model can provide better performance and more interpretable results against strong baselines on problems that require domain knowledge.
The main contributions of this paper are: 1. We propose a seq-to-general tree model that learns to map the problem text into operation trees where the nodes can be formulas with arbitrary number of arguments. This helps to incorporate domain knowledge into problem solving and produce interpretable results.
2. We design a knowledge-guided mechanism that guides tree decoding using mathematical knowledge graphs and GNNs.
3. We curate a middle-sized dataset that contains 1,398 geometry word problems. In addition, we annotate them with detailed formulas that can be readily converted into operation trees.
2 Seq2seq v.s. Seq2tree v.s. Seq2general Our goal is to design a sequence-to-general tree model that learns to map the problem text into its corresponding operation tree. Before diving into the model, we first compare the decoding mechanisms between seq-to-seq, seq-to-tree and our seqto-general tree solvers. Figure 2 illustrates the tree decoding process of these three types of model, respectively. For seq2seq models, their decoder basically does two things: (1) predicting the current output and (2) generating the next state. These two steps can be conditioned on different information including the current state, the current input, or a context vector calculated using attention. The decoder would repeat these two steps until it outputs an end token. For seq2tree models, however, this process is slightly different. The decoder predicts the current output as in seq2seq, but it will decide whether to generate the next state based on the current output. If the current output is a arithmetic operator, the decoder knows it should produce two child states, and these states are used to expand into its left and right children. If the current output is a number, then the decoder would end the decoding process, so the current node becomes a leaf node. As a result, the whole decoding process resembles generating an expression tree in a top-down manner.
In our work, we generalize the decoding process by making the decoder produce a variable number of children based on the type of the current output. If the output is a number or operator, the decoder would produce zero or two child states as before. If the output is a formula, the decoder will generate the pre-specified number of child states for this formula.

966
3 Sequence-to-General Tree Model In this section, we give a detailed description for each part of our S2G model.

Encoder
The main function of the encoder is to encode the problem text P = (x 1 , x 2 , ..., x n ) into a sequence of hidden states (h 1 , h 2 , ..., h n ) and their summary state h encoder . The hidden states h 1 to h n are expected to contain the information for each input token x 1 to x n , while the summary state h encoder is expected to capture the overall information of the problem.
Specifically, we use bidirectional gated recurrent units (GRU) (Cho et al., 2014) as our encoder. Given the current input x t , the previous state h t−1 , and the next state h t+1 , the current state h t ∈ (h 1 , h 2 , ..., h n ) can be calculated with: where the arrows represent different directions in the bidirectional encoding. After calculating the hidden state for each input token, we combine the last state of the forward and backward directions to get the summary state for the encoder:

Geometry Knowledge Graph
To incorporate domain knowledge into problem solving, we propose to utilize the knowledge from mathematical knowledge graphs. The main idea is that given a formula predicted as the current node, we could use the physical meaning of its arguments to help us better predict its children. For example, if the current node is the formula for rectangle area, then we know its child nodes should be related to "length " and "width". We can thus use the node embeddings of "length" and "width" from a geometry KG to provide additional information for our solver. We manually collect a geometry knowledge graph which contains the common geometry shapes (e.g., square, circle) and their geometry quantities (e.g., area, length), and we link these nodes to each other if they belong to the same shape. To embed this KG, we employ a graph convolutional network (GCN) (Kipf and Welling, 2017) that transforms the KG into some vector space and calculates the embedding of each node. Given the feature matrix X and the adjacency matrix A of the KG, we use a two-layer GCN to encode it as follows: where Z = (z 1 , ..., z n ) are the node embeddings for each node in the graph. Then, we can use the embedding to represent the physical meaning of a certain formula argument in the decoding process.

General Tree Decoder
In the decoding stage, the decoder learns to produce the target operation trees in a recursive manner. It first predicts the current output y t in order to determine the number of children of the current node. Given the current decoder state s t , the embedding of the last output e (y t−1 ) , and the node embedding z t which represents the physical meaning in the knowledge graph, the probability of the current output P (y t ) is calculated using: (7) where h n 1 is the encoder states (h 1 , ..., h n ), c t is the context vector of e (y t−1 ) with respect to h n 1 , and z t is another context vector calculated using the node embedding z t and h n 1 . Specifically, we use additive attention (Bahdanau et al., 2015) to calculate these context vectors and use h encoder as the first decoder state s 0 . Given the probability P (y t ), we can then determine the output tokenŷ t : Next, we predict the child states conditioned on the required number of children forŷ t . Unlike previous binary-tree decoders that use two distinct DNNs to predict the left and right children respectively (Xie and Sun, 2019;, we employ a GRU to predict a variable number of children. Given the current state s t , its child states s t 1 , ..., s tn are generated in a recurrent manner: where we generate the first child s t 1 using s t , and the following child state s t i using its previous sibling s t i−1 until we reach the required number of children. The decoder is basically a GRU followed by a linear projection layer and an activation function: where the input of GRU is the concatenation of e (yt) and c t , W s is the linear projection layer, and ReLU is used as the activation function. After getting these child states, we push them into a stack and repeat the steps from Equation (5) to Equation (11) until all the states are realized into tokens.

Training Objective
For a problem and operation tree pair (P, T), we follow previous seq2tree work (Xie and Sun, 2019; and set our objective to minimize the negative log likelihood:

Dataset
To evaluate our S2G model on problems that require formulas, we curate a middle-sized dataset, GeometryQA, that contains 1,398 geometry word problems. These problems are collected from Math23K (Wang et al., 2017) using the keywords of common geometric objects (e.g., circle, square, etc.) and their shapes (e.g., rectangular, circular, etc.). Then, we re-annotate each problem with their associated formulas if the problem belongs to one of the six major shapes: square, cubic, rectangle, cuboid, triangle and circle. Table 2 shows the overall statistics of GeometryQA and Table 7 in Appendix B shows the 11 formulas we used to annotate these problems. Note that not all problems in GeometryQA are annotated with formulas. About 16% of the problems belong to other shapes (e.g., parallelogram, rhombus, etc.) which currently are not covered in our formula set. About 40% are problems that contain geometric keywords but do not actually require any formulas. Table 3 shows such an example. We use these problems to test the robustness of our model. That is, S2G has to learn to apply the correct formulas or equations from misleading keywords (as shown in Table3) and has to learn to generate both binary expression trees and operation trees as a whole.  Problem: The perimeter of a rectangular swimming pool is 300 m. If you place a chair every 10 m all the way around its perimeter, how many chairs do you need? Equation: x = 300/10 Answer: 30 Table 3: Example problem that contains misleading keywords (perimeter, rectangular) but do not require any geometry formulas.

Implementation Details
We implement our S2G model and the GNN module using Pytorch 2 and Pytorch Geometric 3 . We set the dimension of word embedding to 128 and the dimension of the hidden state of GRU and GNN to 512. The dropout rate (Srivastava et al., 2014) is set to 0.5 and the batch size is 64. For optimization, we use ADAM (Kingma and Ba, 2015) with a learning rate of 10 −3 and a weight decay of 10 −5 . Besides, we use a learning rate scheduler to reduce the learning rate by half every 20 epochs. During evaluation, we use beam search (Wiseman and Rush, 2016) with a beam size of 5.

Experimental Results on GeometryQA
We evaluate our S2G model on GeometryQA to check whether it can learn to predict the corresponding operation tree for the geometry word problems. Table 4 shows the results of our S2G against other seq2tree SOTA models. S2G is trained using the re-annotated equations that contain formulas, while the baselines are trained using the original equations. First, we find that S2G has about 3.8% perfor-mance gain over its baselines (with p-value < 0.01). We attribute this to the fact that operation trees are easier to learn and generate since they are less lengthy and complex than binary expression trees. Hence, there is a better chance for S2G to produce the correct trees and arrive at the correct answers. Second, there is a small performance gain by adding Geometry KG. However, the improvement is not significant (with p-value≈0.8). We guess that is because the dataset currently has only six geometric objects, which is not complex enough to show the effectiveness of adding knowledge graphs.

Model
Accuracy (

Conclusion
In this work, we proposed a sequence-to-general tree model (S2G) that aims to generalize previous seq2tree architectures. Our S2G can learn to generate executable operation trees where the nodes can be formulas with arbitrary number of arguments. By explicitly generating formulas as nodes, we make the predicted results more interpretable. Besides, we also proposed a knowledge-guided mechanism to guide the tree decoding using KGs and constructed a dataset in which problems are annotated with associated formulas. Experimental results showed that our S2G model can achieve better performance against strong baselines.

A Data Preprocessing
In this section, we describe the data preprocessing steps required for our S2G model.

A.1 Converting to prefix notation
To perform top-down tree decoding, we follow (Xie and Sun, 2019) to convert the equations into their prefix notation, where the operators are placed in front of their operands, rather than in between. In this way, the order of the equation tokens would match the order of decoding. In our case, we also need to consider the formulas used in the equation. For a formula in the form "F (arg1, arg2)", we turn it into "[F, arg1, arg2]" so that it can fit into the prefix notation. Table 5 shows an example of this infix-to-prefix conversion for an equation with formulas.
Problem: The outer radius and inner radius of a circular annulus are 5m and 3m respectively. Find the area of this circular annulus. Equation: x = circle area(5) -circle area(3) Prefix form: [ -, circle area, 5, circle area, 3]

A.2 Vocabulary
We follow the canonical sequence-to-sequence architecture  to prepare for the source vocabulary. For the target vocabulary, however, we have to take into consideration the way that humans solve MWPs. To solve a math problem, we use the numbers from the problem text (a dynamic vocabulary) and the mathematical operators learned before (a static vocabulary) and try to compose them into an equation. Sometimes, we also need to use external constant numbers (a static vocabulary) that are not in the problem text but would appear in the equation (e.g., 1, 2, or 3.14). These three types of vocabulary make up the vocabulary for the equations in arithmetic problems (equation 13).
We follow (Xie and Sun, 2019) to use a copy mechanism (Gu et al., 2016) to copy the numbers from the problem text. Hence, we can dynamically get the problem numbers during decoding. Besides, we Vocab Type Instances Operator +, -, *, /,N umber N 0 , N 1 , N 2 , ... Constant 1, 2, 3.14 *Formula circle area, square area, rectangle perimeter, and so on. . Table 6 shows the overall vocabulary that we use for our decoder. (14) B GeometryQA Table 7 shows the 11 formulas used for annotation.

C Related Work
In this section, we briefly introduce the progress of MWP solvers, and then we focus on topics that are closer to our work, including seq2tree solvers and knowledge graphs for problem solving.
Recently, neural architectures have emerged as a dominant paradigm in math word problem solving. Wang et al. (2017) first attempt to use a seq2seq solver that utilize encoder-decoder architectures to encode the problem text and then decode into equations in a way similar to machine translation. Copy mechanism (Huang et al., 2018) or attention mechanisms  are introduced to improvement the performance of seq2seq models. These seq2seq models, however, suffer from producing invalid equations, like a binary operator with three operands, because there is no grammatical constraint on its sequential decoding. To solve this problem, seq2tree models are proposed to bring into the grammatical constraints (Xie and Sun, 2019;Liu et al., 2019). We will give a more detailed introduction to seq2tree models in Section C.2.

C.2 Sequence-to-Tree Models
To convert text into structured representations, several research strands have utilized sequence-to-tree models. Dong and Lapata (2016) first use seq2tree on semantic parsing to translate text into structured logical forms. Similar frameworks are also adopted for code generation (Yin and Neubig, 2017;Rabinovich et al., 2017) where they translate code snippets into executable representations or abstract syntax trees (ASTs).
Inspired by their ideas, MWP solving also adopts seq2tree to map the problem text into expression trees. This introduces a constraint that the non-leaf nodes of the tree should be operators and leaf nodes be numbers, and thus the resulted equations are always guaranteed to be valid. Most seq2tree solvers choose bidirectional LSTM or GRU as their text encoder and use two separate models to predict the left and right nodes during decoding respectively (Xie and Sun, 2019;Li et al., 2020). Our model differs from the previous that we use a single RNN-based decoder to predict a variable number of children nodes during decoding. In addition, our model can predict formulas as nodes that increase the interpretability of the model outputs, while previous solvers can only predict basic arithmetic operators.

C.3 Knowledge Graph for Math Word Problem Solving
To incorporate external knowledge into problem solving, some solvers utilize graph convolutional networks (Kipf and Welling, 2017) or graph attention networks (Veličković et al., 2018) to encode knowledge graphs (KGs) as an additional input.  proposed to incorporate commonsense knowledge from external knowledge bases. They constructed a dynamic KG for each problem to model the relationship between the entities in the problem. For example, "daisy" and "rose" would be linked to their category "flower" so that the solver can use this hyperonymy information when counting the number of flowers.  proposed to build graphs that model the quantityrelated information using dependency parsing and POS tagging tools (Manning et al., 2014). Their graphs provide better quantity representations to the solver. Our model differs from previous models that we aim to incorporate domain knowledge from mathematical KGs rather than from commonsense knowledge bases.