An Edge-Enhanced Hierarchical Graph-to-Tree Network for Math Word Problem Solving

Math word problem solving has attracted considerable research interest in recent years. Previous works have shown the effectiveness of utilizing graph neural networks to capture the relationships in the problem. However, these works did not carefully take the edge label information and the long-range word relationship across sentences into consideration. In addition, during generation, they focus on the most relevant areas of the currently generated word, while neglecting the rest of the problem. In this paper, we propose a novel Edge-Enhanced Hierarchical Graph-to-Tree model (EEH-G2T), in which the math word problems are represented as edge-labeled graphs. Specifically, an edge-enhanced hierarchical graph encoder is used to incorporate edge label information. This encoder updates the graph nodes hierarchically in two steps: sentence-level aggregation and problem-level aggregation. Furthermore, a tree-structured decoder with a split attention mechanism is applied to guide the model to pay attention to different parts of the input problem. Experimental results on the MAWPS and Math23K dataset showed that our EEH-G2T can effectively improve performance compared with state-of-the-art meth-ods. 1


Introduction
Math word problem solving is an important natural language processing (NLP) task that has recently been attracting increasing research interests. Math word problems are narrative text that describe a scene with several math variables and ask a question about an unknown quantity. A simple example is illustrated in Figure 1. Based on the given problem, the target is to infer the difference between the number of boxes of apples and pears.  (Zhang et al., 2020b) and EEH-G2T.
Previous works Huang et al., 2018;Wang et al., 2019) used sequenceto-sequence (seq2seq) methods with an attention mechanism (Bahdanau et al., 2014) to generate math expression sequences from math word problems. To capture the structural information of math expressions, many works (Liu et al., 2019;Xie and Sun, 2019;Zhang et al., 2020a) treat math expressions as binary trees and propose several sequence-to-tree (seq2tree) frameworks. These tasks are designed to obtain the pre-order sequence of the expression tree, and they generate the current node based on its parent node and sibling node at each time step. Some works that represent problems as graphs also show better performance. Graph2Tree (Zhang et al., 2020b) connects each number in the problem with its nearby nouns to enrich the quantity representations. KA-S2T (Wu et al., 2020) connects words with its category in the external knowledge base to capture common sense information.
Although these methods report promising results, several challenges still remain. 1) Long-range word relationships across sentences should be taken into consideration. As shown in Figure 1, the word "pear" in the second sentence should be associated to the word "pear" in the last sentence. Without long-range relationships, it is difficult for the model to connect these two words that are 15 steps apart. 2) Previous methods did not carefully take the edge label information into consideration. In figure 1, the label on the edge between "kilograms" and "pear" is nmod (noun compound modifier), while the label "category" on the edge between "apples" and "pears" means they belong to the same category in the external knowledge base. Such edge labels can also provide rich syntactic and semantic information. 3) When generating expressions, previous methods tend to focus on the areas in the problem that are most relevant to the currently generated words, and ignore the semantic clues provided by the rest of the problem. As shown in Figure 2, to generate "360" instead of "240" at time step 3, the model needs to pay attention to the entire problem to obtain important clues that the current sub-expression "/360 24" is the number of apple boxes and 360 is the weight of the apples. However, previous methods focused on the problem areas that are most relevant to the currently generated word (i.e., the number 360 itself), without noticing the rest of the problem.
To tackle these challenges, we propose a novel Edge-Enhanced Hierarchical Graph-to-Tree framework (EEH-G2T) for math word problem solving. EEH-G2T represents each math word problem as a graph in which the nodes are connected by labeled edges. To obtain the edge-aware problem representations, we propose an edgeenhanced hierarchical graph encoder that explicitly incorporates edge label information. In addition, the hierarchical encoder updates the nodes in two steps: sentence-level aggregation and problemlevel aggregation. This hierarchical structure can first capture the local relations between words within the sentence and then capture the long-range dependencies between words across sentences. Further, we use a split attention mechanism to guide the decoder to pay attention to different parts of the entire input problem, not just the most relevant part of the currently generated word.
The main contributions of this paper can be summarized as follows: • We propose an edge-enhanced hierarchical graph encoder to incorporate edge label information. Additionally, the encoder updates the graph nodes in two steps, namely sentencelevel aggregation and problem-level aggregation.
• We propose a split attention mechanism to guide the decoder to pay attention to different parts of the entire input problem during the generation.
• We conducted experiments on two commonly used math word problem solving datasets, MAWPS and Math23K. Experimental results prove that our approach can effectively improve the performance compared with stateof-the-art methods.

Problem Formulation
In this work, we focus on generating math expressions for the given math word problems. We denote the text of a math word problem as a sequence of words and number symbols. X=(x 1 , x 2 , . . . , x m ) is a math word problem with m words. Our model aims to generate a math expression Y= (y 1 , y 2 , . . . , y T ). Here, Y is a pre-order traversal  Figure 3: The procedure for construction of a edge-labeled graph is described here. For brevity, we omit some self-node edges and the labels of some neighbor edges and dependency edges. Given a math word problem, we first use the Stanford Corenlp toolkit to parse it into a dependency tree, and extract the relationships between nouns from the external knowledge bases. Based on these, we construct the edge-labeled graph, as shown in the bottom part of the figure ( See Section 2.2 for more details).
sequence of a math expression tree, which can be executed to generate the answer to problem X.
Formally, math word problem X can be represented by a graph G = (V, E), where V and E are the set of nodes x i and the set of edges e ij . Here, each node in the graph is associated with a word x i in the problem. e ij ∈ E denotes that there is an edge between the node pair (x i , x j ). L(e ij ) denotes the label of edge e ij (e.g., self-node, category, neighbor), see section 2.2 for more details.

Graph Construction
This section introduces how to construct an edgelabeled graph that contains both the local relations between nodes within a sentence and the longrange relations between nodes across sentences. Our model extracts these relations from the problem's dependency tree and external knowledge base. We use the Stanford Corenlp toolkit 2  to parse each math word problem into a dependency tree. The toolkit analyzes the grammatical structure of a sentence and establishes relationships between "head" words and words which modify those heads. In addition, inspired by , we collected word category information from external knowledge bases. An illustrative example is shown in Figure  3. Specifically, given a math word problem X, its dependency tree, and word category information, our model constructs a graph according to the following steps.
• Self node & Neighbor: We define each word x i in the problem X as a node. Each word node x i is connected to its adjacent word nodes (x i−1 , x i+1 ) in the problem. These edges are labeled as "neighbor". Also, to incorporate the node's own information into the problem representations, we connect each node to itself and label the edge as "self node".
• Dependency (edges within sentences): The dependency tree is a structured representation that contains various grammatical relationships between word pairs. Following Zhang et al. (2020b), we prune the output dependency tree to remove unimportant components, that is, remove edges connected to conjunctions, prepositions or punctuation. Based on the dependency tree, we establish relationships between nodes within the sentence, and keep the edge labels (e.g., nmod, nummod, appos). For example, "360" and "kilograms" are connected by the edge "nummod" in Figure 3.
• Same & Category (edges across sentences): To further capture the connection across sentences, if the same word exists in two sentences and it is a noun, then we connect these two nodes and label the edge as "same". If two words belong to the same category in the external knowledge base, we also add a connection for their nodes and label the edge as "category". For example, "apples" and "pears" are connected by the edge "category" in Figure 3.

Graph Initialization
To initialize the node representations of the graph, we use a BiLSTM (Hochreiter and Schmidhuber, 1997) to encode the words in the math word problem X=(x 1 , x 2 , . . . , x m ). Here, H 0 = (h 0 1 , h 0 2 , . . . , h 0 m ) ∈ R m * d is the initial node representations of its graph G, where m is the number of nodes and d is the dimension of the node representation. The representation h 0 i of node x i is calculated as follows: where Embed(·) is an embedding layer. For each edge e ij , we initialize the edge representation e 0 ij based on the edge embedding and its neighbor node representations h 0 i , h 0 j : where W e is a weight matrix and [:] is the concatenation operation.

Edge-Enhanced Hierarchical Graph Encoder
After initializing the graph, EEH-G2T uses an edgeenhanced hierarchical graph encoder to obtain the edge-aware problem representations. It hierarchically updates the nodes in two steps: sentence-level aggregation and problem-level aggregation. We divide math word problems into short sentences based on commas and periods. For example, the problem in Figure 1 has four sentences. Sentence-level Aggregation.
To capture the local relations between words, EEH-G2T first recursively aggregate the node representation with its related nodes within the sentence. Let A denote the local relationship matrix, where A ij ∈ {0, 1} denotes whether there is an edge between x i and x j . Formally, A ij = 1 if e ij ∈ E and x i , x j in the same sentence, otherwise A ij = 0. The initial node representations H 0 = (h 0 1 , h 0 2 , . . . , h 0 m ) are aggregated with a two-layer graph convolutional network (GCN) (Kipf and Welling, 2017). The aggregation functions are as follows: Here, W g is a weight matrix and σ is a relu activate function. After sentence-level aggregation, we obtain the node representations H 1 = (h 1 1 , h 1 2 , . . . , h 1 m ). Problem-level Aggregation.
Then, EEH-G2T use an attentive problem-level aggregation to capture long-range dependencies across sentences. Inspired by GAT (Veličković et al., 2018), we use the multi-head attention in GAT with M independent attention mechanisms: Here, w T a , W a , W b , W c , W j are weight vector and matrices. σ is a LeakyRelu activate function (Xu et al., 2015). || is the concatenation operation. α ij is the normalized attention weight of the node x j for node x i via the softmax function. After problem-level aggregation, we obtain the final problem representations H = (h 1 , h 2 , . . . , h m ).

Tree-structured Decoder
The structure of the decoder is similar to other stateof-the-art Seq2Tree models (Xie and Sun, 2019;Zhang et al., 2020b;. The decoder is an attention-based Gated Recurrent Unit (GRU) (Chung et al., 2014) whose goal is to generate preorder traversal of expression trees. The hidden state s t is updated as follows: s t+1 = BiLSTM([Embed(y t ) : c t : r t ], s t ).
(5) At time step 1, we use the last problem representations h m to initialize the decoder hidden state s 1 . Here, Embed(y t ) denotes the embedding of the last generated word y t ; c t denotes the context state of the problem representations, and r t denotes the context state of the currently generated expression. Split Attention Mechanism. Figure 4 shows the input of our proposed split attention mechanism, which is the final problem representations of the graph encoder. EEH-G2T first uses an attention mechanism to compute the overall attention vectorα on the problem representations. Then, EEH-G2T divides the input math word problem into K parts, conducts attention operations on each part, and obtains K split attention vectors (α 1 , α 2 , . . . , α K ). The size of each split attention vector is R (m/K) . In Figure  1, when the decoder generates y 3 =360, EEH-G2T notices that the word most relevant to the current decoder state is "360" in the first sentence. At the same time, EEH-G2T obtains crucial semantic clues from the other parts, that is, the problem asks how many pear boxes are less than the apple boxes. Based on K attention vectors, the problem context state c t is calculated as follows: where W s , W h are the weight matrices. α k ti denotes the attention distribution on the k-th part of the problem representations at time step t. Expression Aggregation Mechanism.
Following , we use a state aggregation mechanism to compute the expression context state r t : σ is a sigmoid function and W r is a weight matrix. At time step 1, we use the decoder state s 1 to initialize expression context state r 1 . For each node in the currently generated expression tree, r t,p , r t,l and r t,r represent the expression context state of the parent node, left child node, and right child node of the current node. If the current node does not have parent or child node at this time step, we pad it with a PAD vector. Finally, we use a copying mechanism (Gulcehre et al., 2016) so that the model either generate a word from the vocabulary or copy a word from the input problem X. At time step t, based on the decoder state s t , the problem context state c t and the expression context state r t , EEH-G2T calculates a copy gate value g t ∈ (0, 1) to determine whether the word y t is generated or copied: , P(y t |y <t ,X) = g t P c (y t )+(1−g t )P g (y t ).
(8) W s , W c , W r and W g are weight matrices.α ti is the overall attention vector in the split attention mechanism. The probability distribution P(y t |y <t ,X) of generating y t is calculated over the copy distribution P c (y t ) and generate distribution P g (y t ).

Training
We train the model with the cross-entropy loss, defined as: log P(y t |y <t ,X).
During the inference, we use beam search to generate final expression. At time step t, if y t is an operator, the current node is an internal node, and the model continues to generate its child nodes. If y t is a number, it represents a leaf node with no child node. Once the children of all the internal nodes have been generated, the generated expression sequence Y= {y 1 , y 2 , . . . , y T } is transformed into an expression tree, and the decoding process is terminated.

Datasets
We evaluated our model on two commonly used math word problem datasets, MAWPS (Koncel-Kedziorski et al., 2016) with 2,373 problems and Math23K  with 23,162 problems. We adopt the data preprocessing provided by . Following previous studies (Xie and Sun, 2019;Li et al., 2020;, we use the same data split for the train/dev/test set. The Stanford CoreNLP toolkit is used for dependency parsing. Hownet (Dong et al., 2010) and Cilin (Mei, 1985) are used as external knowledge bases. We choose words that appear more than 5 times in the training set or appear as edge labels to build a vocabulary, and replace words that are not in the vocabulary with a UNK token. We use answer accuracy as the evaluation metric.

Implementation Details
We used Pytorch for our implementation 3 . We used 300-dimensional Glove word embeddings (Pennington et al., 2014). The hidden size is 512. The batch size is 64. The number of heads M in problem-level aggregation is 8. The number K of split attention vectors is 2. We set the learning rate of the Adam optimizer (Kingma and Ba, 2014) to 0.001, and the dropout is 0.5.
During training, it took 120 epochs to train the model. During decoding, we used a beam search with a beam size of 5. We used the same parameter settings for both Math23K and MAWPS datasets. The hyper-parameters are tuned on the valid set.

Baselines
We compare the performance of our model with the following baselines: DNS ) is a seq2seq model that consists of a two-layer GRU encoder and a two-layer LSTM decoder. Math-EN (Wang et al., 2018) is a seq2seq model with a bidirectional LSTM encoder and an attention mechanism. Recu-RNN (Wang et al., 2019) uses recursive neural networks on the predicted tree structure templates. Tree-Dec (Liu et al., 2019) is a seq2tree model with a tree-structured decoder, which generates each node based on its parent and sibling node. GTS (Xie and Sun, 2019) is a seq2tree model that generates expression trees in a goal-driven manner. It generates each node based on its parent node and its left sibling subtree embedding. KA-S2T  is a graphto-tree model with commonsense knowledge from the external knowledge base. It uses a state aggregation mechanism to recursively aggregate neighbors of each node in the expression tree. Graph2Tree (Zhang et al., 2020b) is a graph-totree model that leverages the nouns nearby the numbers to enrich the quantity representations in the problem.  Table 2: Ablation analysis of edge-enhanced hierarchical graph encoder and split attention mechanism used in EEH-G2T.

Results Analysis
can observe that: 1) Two graph-to-tree model, KA-S2T and Graph2Tree, performed significantly better than the Seq2Tree model GTS, showing that the graph structure in the encoder is effective in enriching the problem representations. 2) Our proposed EEH-G2T outperformed all the other baselines, which proved the effectiveness of using an edge-enhanced hierarchical graph encoder and split attention mechanism.

Ablation Study
Effect of Hierarchical Graph Encoder. As shown in Table 2, we estimate the effectiveness of the proposed hierarchical graph encoder. From the results, both sentence-level aggregation and problem-level aggregation improve the performance. Removing the sentence-level aggregation reduces answer accuracy by 1.1%, and removing the problem-level aggregation reduces answer accuracy by 0.7%. When we remove the both aggregation mechanisms and use the initial node representations as the final problem representations, the answer accuracy decreases by 2.0%. We believe that the superior performance of the hierarchical graph encoder is because it captures both the local relations between words   within a sentence and the long-range relations between words across sentences.

Effect of Edge Label Information and Split Attention Mechanism.
To prove the effectiveness of edge label information and split attention mechanism in the proposed EEH-G2T, we conduct ablation experiments on the Math23K dataset as shown in Table 2. We observe a slight accuracy drop by 0.4% after removing the edge label information, demonstrating that edge labels provides syntactic and semantic information to enrich the problem representations. Moreover, removing the split attention mechanism leads to a drop by 0.8%, which verifies the effectiveness of using a split attention mechanism. Effect of Different Edge Categories. Table 3 shows the performance when removing one edge category at a time. We can see that all the edge categories have positive effects on the model performance. The performance of the model without "self node" edges drops the most, because "self node" allows the model to keep the information of the node itself. Additionally, removing "category" and "neighbor node" edges will slightly reduce model performance. Without "dependency" and "same word" edges, model accuracy will drop to 76.9 % and 76.0%.

Split Number in Split Attention Mechanism.
To explore the impact of the number K of split vectors, we conduct the parameter experiment on Problem 1: In a library, science books account for 20\% of the collection, story books account for 1/3 of the collection, and there are 500 fewer science books than story books. How many total books are there in the library?
Graph2Tree: / 500 -20% (1/3) EEH-G2T: / 500 -(1/3) 20% Problem 2: Alan produced 648 machine parts in 8 hours , Ben produced 72 machine parts in 4 hours. How many more parts does Alan produce per hour than Ben? the Math23K valid set by varying the split number K from 0 to 5. As shown in Table 4, when the number K increases from 0 to 2, noticeable improvements are remarked on answer accuracy. These result once again confirms the effectiveness of the split attention mechanism because it allows the model to pay attention to different parts of the input problem. The performance starts to drop since K ≥ 3. This is probably because more splits means that the problem is split into more parts, so that the model can obtain more information. However, too many splits may break the problem into small fragments, leading to noise. We set the number K of split vectors to 2 in other experiments. Figure 5 lists two examples generated by Graph2Tree and our EEH-G2T model. In Problem 1, Graph2Tree missed the information that there are fewer science books than story books, and incorrectly generated "-20% (1/3)". With split attention mechanism, EEH-G2T can better capture this information from the enatire problem. In Problem 2, Graph2Tree incorrectly uses Ben's production speed to subtract Alan's production speed. With hierarchical graph encoder, EEH-G2T can build long-range relations across sentences and therefore generate correct results.

Related Work
Math Word Problem Solving: Solving math word problems has long been a very popular task and various methods have been proposed in the past few years (Ling et al., 2017;Wang et al., 2017Wang et al., , 2018. Previous methods usually treated the math word problem as a sequence, and use the same linear encoder to encode math word problems (Liu et al., 2019;Xie and Sun, 2019). Recently, many works that treat math word problems as graphs have shown better performance. Zhang et al. (2020b) connects each number in the problem with nearby nouns to enrich the problem representations.  connects words that belong to the same category in the external knowledge base to capture common sense information. Li et al. (2020) construct an input graph from both the math problem and its corresponding dependency tree to incorporate structural information. However, these methods only capture the local neighbor information of nodes as additional features to enrich the problem representations and ignore the long-range relations across sentences.
In this paper, we propose an edge-enhanced hierarchical graph encoder that captures both the local relations between words within a sentence and the long-range relations between words across sentences. To further guide the decoder to pay attention to different parts of the entire input problem, we propose a split attention mechanism. Graph Neural Networks: Many works on graph neural networks (GNNs) have been applied to a variety of tasks in recent years, such as node classification (Veličković et al., 2018;Klicpera et al., 2019), relation extraction Sahu et al., 2019), and code summarization (Zügner et al., 2021;Liu et al., 2021). Sahu et al.
(2019) proposed a labeled edge graph convolutional neural network model on a document-level graph for inter-sentence relation extraction. (Cui et al., 2020) simultaneously exploits syntactic structure and typed dependency labels to improve neural event detection. Inspired by such works, we also leverage edge label information to enrich the problem representations.

Conclusion
In this study, we proposed a novel edge-enhanced hierarchical graph-to-tree model called EEH-G2T for the math word problem solving task. We used an edge-enhanced hierarchical graph encoder that updates the graph nodes in two steps, namely sentence-level aggregation and problem-level aggregation. Additionally, edge label information was incorporated into the model to enrich the problem representations. We proposed a split attention mechanism to guide the decoder to pay attention to different parts of the entire input problem during generation. Experimental results confirmed that the proposed model, EEH-G2T, outperformed other state-of-the-art models.