Improving Math Word Problems with Pre-trained Knowledge and Hierarchical Reasoning

The recent algorithms for math word problems (MWP) neglect to use outside knowledge not present in the problems. Most of them only capture the word-level relationship and ignore to build hierarchical reasoning like the human being for mining the contextual structure between words and sentences. In this paper, we propose a Reasoning with Pre-trained Knowledge and Hierarchical Structure (RPKHS) network, which contains a pre-trained knowledge encoder and a hierarchical reasoning encoder. Firstly, our pre-trained knowledge encoder aims at reasoning the MWP by using outside knowledge from the pre-trained transformer-based models. Secondly, the hierarchical reasoning encoder is presented for seamlessly integrating the word-level and sentence-level reasoning to bridge the entity and context domain on MWP. Extensive experiments show that our RPKHS significantly outperforms state-of-the-art approaches on two large-scale commonly-used datasets, and boosts performance from 77.4% to 83.9% on Math23K, from 75.5 to 82.2% on Math23K with 5-fold cross-validation and from 83.7% to 89.8% on MAWPS. More extensive ablations are shown to demonstrate the effectiveness and interpretability of our proposed method.


Introduction
Math Word Problem (MWP) is a reasoning task for answering a mathematical query based on the problem description, which is an interdisciplinary research topic to bridge the mathematics and natural language processing. As shown in Table 1, a short narrative is presented to describe a problem and poses a question about the unknown quantity. In recent years, research on MWP by using deep learning methods has been gaining increasing attention. Early research mainly focuses on Seq2Seqbased models (Sutskever et al., 2014;  2017b; Wang et al., 2017;Huang et al., 2018;Wang et al., 2017). These Seq2Seq-based methods aim to train an end-to-end model from scratch by using the training dataset. Some research focuses on developing structure-based approaches (Xie and Sun, 2019a;Wang et al., 2018aWang et al., , 2019bLiu et al., 2019a;Zhang et al., 2020b;Li et al., 2020b;Hong et al., 2021;Li et al., 2020a) by incorporating parsing tree into the neural models to produce promising results in generating solution expression for the MWP.
To answer this question, human beings not only need to parse the question and understand the context but also use external knowledge. However, the previous methods learn the textual description purely from the short and limited narrative without using any background knowledge that not present in the description, which restrain the ability of the models for inferring the MWP from a global perspective. Moreover, current methods mainly focus on designing diverse entity-level structures for word-level reasoning rather than bridging the hierarchical reasoning between the entity (word-level) and context (sentence-level). Obviously, it is not enough to use single-level reasoning for solving the MWP. In this paper, we propose reasoning with pre-trained knowledge and hierarchical structure (RPKHS) to jointly solve the two limitations.
Our RPKHS as shown in Figure 2 consists of (b) Sentence-level Reasoning How much …? Figure 1: (a) Word-level reasoning is to build the relationship of each word in all textual descriptions, which can also be considered as entity-level reasoning; (b) Sentence-level reasoning aims at mining the intra-relationship of each sentence from the paragraph. (c) Hierarchical reasoning is to jointly excavate intra-relationship and interrelationship between word and sentence from the same paragraph.
two encoders, namely pre-trained knowledge encoder and hierarchical reasoning encoder, and a tree-structured decoder. It effectively incorporates the implicit linguistic knowledge into the model via pre-trained knowledge encoder and generates structural representation by our hierarchical reasoning encoder. The outputs of the two encoders are fed into a tree-structured decoder (Xie and Sun, 2019b) for final prediction.
To the best of our knowledge, we are the first one to study the application of pre-trained knowledge to the MWP task. We have implicit knowledge which is embedded into some non-symbolic form such as the weights of a neural network derived from annotated data or large-scale unsupervised language training. Recently, Transformer-based (Vaswani et al., 2017) and specifically BERT-based (Devlin et al., 2019b;Liu et al., 2019c) models have been proposed, which incorporate large-scale linguistic pre-training, implicitly capturing language-based knowledge. This type of knowledge can be quite useful for parsing the textual description.
For example, there are two sentences: 'He has 25000 dollars in his bank account.'; 'Paul appeared before the faculty to account for his various misdemeanors'. The word 'account' has totally different meanings between the two sentences due to different scene-aware descriptions. Hence, we think such diverse semantics of each word containing rich representation in the implicit pre-trained knowledge. Such knowledge can be also regarded as a huge implicitly vocabulary to endow each word with rich representation. It can help the model to parse the correct semantics of words from complex text. In this paper, we take advantage of the implicit knowledge in pre-trained Roberta (Liu et al., 2019c) and analyze the effect of various pre-trained knowledge on the MWP task.
Current methods mainly learn the MWP by building word-level reasoning (as shown in Figure 1 (a)) by GNN (Zhang et al., 2020b;Li et al., 2020b) and Seq2Seq model (Wang et al., 2017). They seldom consider modeling hierarchical structure. Since the descriptions of MWP have a hierarchical structure (words from sentences, sentences from a narrative), we likewise construct hierarchical reasoning (as shown in Figure 1 (c)) by first building representations of sentences from words, and then aggregating those into a whole narrative representation.
It is observed that different words and sentences in a mathematical narrative are differentially informative. The importance of words and sentences are highly context-dependent, i.e. the same word or sentence may be differentially important in different contexts (e.g., 5 dollars and 5 pencils, the word of 5 has different meanings.). To include sensitivity to this fact, our model includes two levels of reasoning mechanisms. One at the word-level and one at the sentence-level. They lead the model to pay more or less attention to individual words and sentences when constructing the representation of the narrative. Taking an example as shown in Table 1, intuitively, the first, second and fourth sentences have stronger information in assisting the prediction of the solution. Within these sentences, the words 25000 dollars and every month contribute more in inferring the math-aware results. In this paper, we propose a hierarchical reasoning encoder to achieve this functionality.

Contributions.
(1) As far as we know, we are the first one to explore pre-trained knowledge on the MWP task via our pre-trained knowledge encoder.
(2) We propose a hierarchical reasoning encoder to seamlessly integrate the word-level and sentencelevel reasoning for bridging the entity and context domain on MWP. It can provide insight into which words and sentences contribute to the prediction which can be of value in applications and analysis.
(3) Our RPKHS outperforms previous approaches by a significant margin.

Related Work
The MWP is the task of translating a short paragraph consisting with multiple short sentences into target mathematical equations. Previous approaches usually solve the MWP by using rulebased methods (Yuhui et al., 2010;Bakman, 2007), statistical machine learning methods (Kushman et al., 2014;Mitra and Baral, 2016;Roy and Roth, 2018;Zou and Lu, 2019), semantic parsing methods (Shi et al., 2015;Roy and Roth, 2015;Huang et al., 2017) and deep learning methods (Ling et al., 2017a;Wang et al., 2018b;Liu et al., 2019b;Wang et al., 2017;Zhang et al., 2020a). Recently, the deep learning based methods have been paid more attention for their significant improvement. (Wang et al., 2017) proposed a Seq2Seq-based model to directly map the linguistic text to a solution. (Wang et al., 2018b) and (Chiang and Chen, 2019) implicitly modeled tree-based structure for decoding the MWP expressions, while (Wang et al., 2019a;Liu et al., 2019b;Xie and Sun, 2019b) optimized the decoder via explicit tree structure. Some research focused on graph structure on word-level reasoning. For example, (Zhang et al., 2020a) built two customized graphs for enriching the quantity representations in the problem. (Li et al., 2020b) presented a graph-to-tree encoder-decoder framework for grammar parsing.
However, they ignore the sentence-level relationship and the correlation between word and sentence. Different from the previous methods, we propose to use hierarchical reasoning containing word-level and sentence-level reasoning. Besides, we are the first ones to explore the effect of implicit knowledge from the pre-trained neural network weights on the task of math word problems.

Overview
In this section, we explain the architecture and design of our proposed RPKHS network (i.e. Reasoning with Pre-trained Knowledge and Hierarchical Structure) composed of pre-trained knowledge encoder, hierarchical reasoning encoder and treestructured decoder, which can appropriately incorporate the outside knowledge into the model and bridge the hierarchical reasoning between the entity (word-level) and context (sentence-level). The overview of our RPKHS is illustrated in Figure 2. Our contributions mainly focus on the design of a joint-learning framework and two innovative encoders on the MWP task, which are unveiled and discussed in detail in the following sections.

Problem Formulation
The math word problems (MWP) can be formulated as (P, E), where P is the problem text and E is a solution expression. Assuming a description of MWP has L sentences s i , and each sentence contains T i words. w it with t ∈ [1, T ] represents the words in the i-th sentence. Our proposed encoders project the raw problem descriptions into a vector representation, on which we build a tree-structured decoder to predict the mathematical expression.

Pre-trained Knowledge Encoder
We want to incorporate implicit external knowledge as well as math-aware knowledge which can be learned from the training set in our model. Language models, and especially transformer-based language models, have shown to contain commonsense and factual knowledge (Petroni et al., 2019;Jiang et al., 2019). We adopt this direction in our model and build an encoder, pre-trained with Roberta (Liu et al., 2019c), which has been pre-trained on the huge language corpora (e.g., BooksCorpus (Zhu et al., 2015), Wikipedia (Remy, 2002)) to capture implicit knowledge. We tokenize a description Q using WordPiece (Wu et al., 2016) as in BERT (Devlin et al., 2019a), giving us a sequence of |Q| tokens and embed them with the pretrained Roberta embeddings and append Roberta's positional encoding, giving us a sequence of ddimensional token representation x Q 1 , ..., x Q |Q| . We feed these into the transformer-based pre-trained knowledge encoder, fine-tuning the representation during training. We mean-pool the output of all  The hierarchical reasoning encoder receives the textual embedding to construct inter-relationship between sentence and word to aggregate semantics among entity and context. The pre-trained knowledge encoder captures a large amount of knowledge about the linguistic world from the pre-trained network weights, and incorporates the implicit knowledge into the input embedding to enrich the input representation. Then we concatenate the results from two encoders as the input of a tree-structured decoder for parsing the target mathematical equation and solution.
transformer steps to get our combined implicit knowledge representation Y p .

Hierarchical Reasoning Encoder
The proposed hierarchical reasoning encoder takes into account that the different parts of a math description have no similar relevant information. Moreover, determining the relevant sections involves modeling the interactions among the words, not just their isolated presence in the text. Therefore, to consider this aspect, the model includes two levels of reasoning mechanisms. One reasoning at the word level and the other at the sentence level, which let the model pay more or less attention to individual words and sentences when constructing the whole description representation. The hierarchical reasoning encoder is composed of 2 layers. The first layer is our word-level reasoning layer and the second layer is the sentence-level reasoning layer. In the following sections, we first introduce the GRU-based operation commonly used in our two layers. Then we present the details of the two reasoning layers.
GRU-based Sequence Encoding. The GRU (Bahdanau et al., 2015) uses a gating mechanism to track the state of sequences without using separate memory cells. There are two types of gates: the reset gate r t and the update gate z t . They jointly control how information is updated to the state. At time t, the GRU computes the new state as This is a linear interpolation between the previous state h t−1 and the current new stateĥ t computed with new sequence information. The gate z t decides how much past information is kept and how much new information is added. z t is updated as where x t is the sequence vector at time t. The candidate stateĥ t is computed bŷ where r t is the reset gate which controls how much the previous state contributes to the candidate state. If r t is zero, then it forgets the past state. The reset gate is updated by The W and U mean the learnable matrix weights and the b is the learnable bias vector.
Word-level Reasoning. In this layer, the model uses bidirectional GRU (Bahdanau et al., 2015) to produce representation of words by summarizing information from both directions. Therefore, it incorporates the contextual information in the wordlevel representation. Given a sentence with words w it , t ∈ [1 , T ] and an embedding matrix W e , a bidirectional GRU contains the forward GRU − → f which reads the sentence s i from w i1 to w iT and a backward GRU ← − f which reads from w iT to w i1 by using The word-level representation for a given word w it is obtained by concatenating the forward hidden state and backward hidden state, i.e., , which summarizes the information of the whole sentence centered around w it . Not all words contribute equally to the representation of the sentence meaning. Hence, we introduce an attention mechanism to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector. Specifically, We first feed the word-level feature h it through a one-layer MLP to get u it as a hidden representation of h it . Then we measure the importance of the word as the similarity of u it with a word-level context vector u w and get a normalized importance weight α it through a softmax function. After that, we compute the sentence vector s i as a weighted sum of the word representations based on the learnable weights. The word context vector u w in Eq. 9 can be seen as a high-level representation of a fixed query like "what is the informative word" over the words. It is inspired by the memory networks (Kumar et al., 2016). The word context vector u w is randomly initialized and jointly learned during the training process.
Sentence-level Reasoning. Given the sentence vectors s i , we get a problem description vector in an analogical way. We use a bidirectional GRU to encode the sentences: where − → f and ← − f mean the forward GRU and backward GRU, respectively. We concatenate The h i summarizes the neighbor sentences around sentence i but still focus on sentence i . To reward sentences that are relevant to correctly parse the problem description, we again use attention mechanism and introduce a sentence level context vector u s to measure the importance of the sentences, which can be formulated as where v is the global text vector that summarizes all the information of sentences in a description. Similarly, the sentence-level context vector u s can be randomly initialized and jointly learned during the training process.
Merging Mechanism. After getting the results Y p and Y h from pre-trained knowledge encoder and hierarchical reasoning encoder, respectively, we utilize a parser at the end of two encoders as shown in Figure 2 to adaptively merge Y p and Y h to get an enhanced representation Y for final decoding. The parser can be formulated as where w p and w h are derived from the Y p and Y h to calculate the importance of the task.
[.] means concatenation operation. We use a simple dot product to merge the two representations (Y p and Y h ). Then we use linear mapping function F such as fully connected (FC) layer to produce the enhanced representation Y for final decoding. The w p and w h can be calculated as where W p and W h are both trainable weighted matrices, and ϕ p and ϕ h indicate different MLPs.

Decoder and Optimization
Tree-structured Decoder. Following the goaldriven tree structure (GTS) (Xie and Sun, 2019b), we apply a tree-structured decoder as shown in Figure 2 to leverage the outputs of our encoders for generating the tree-structured targets like mathematical equations. The math equation often consists of operators and quantities. Firstly, the quantity is defined as a leaf node and each operator node is required to have two child nodes. Then, the tree-structured decoder parses an equation expression by following the pre-order traversal ordering. Firstly, the most center operator is generated, followed by the left child node. The generation process is recursively used until the final leaf node is completed. Next, we similarly generate the right child nodes. To achieve the above-mentioned tree generation, our model initializes the root node vector according to the global context representation Y from two encoders. The expression trees in our decoder contain three types of nodes: math operators V op , constant quantities V con that are those common-sense numerical values encountered in the target expression but not in the problem text (e.g. a rabbit has 4 legs.), and the numbers n P encountered in problem P . For each token y in the target vocabulary V tar , its token embedding e(y|P ) is defined as where M op and M con are two trainable word embedding matrices independent of the specific problem. However, for a numeric value in n P , we take the corresponding hidden state h p loc from encoder as its token embedding, where loc(y, P ) is the index position of numeric value y in P . The constant quantities V con and numbers n P are always set to be in leaf nodes position. The math operators V op take up the non-leaf positions. The representation of n P is dependent on certain MWP descriptions. Because y should take the corresponding hidden state h p loc from the encoder outputs. The representations of V op and V con are independently obtained from by two embedding matrices M op and M con .
In regard to the tree-structured decoder, we mainly followed the GTS (Xie and Sun, 2019b) to parse the root vector to math equations. Being the same with the decoder of GTS, we have prepared the candidates for operators and numbers in our target vocabulary. Then we used the root vector with trainable vectors iteratively to predict the probability of node token y from the target vocabulary. Then the specific y (operation, number, etc.) with the highest probability will be selected to replace with the tree node according to the rules in Equation (19). Optimization. Since the MWP task can be formulated as (P, E), we define its loss function as L(E, P ), which can be formulated as a sum of the negative log-likelihoods of probabilities for predicting t-node token y t . Formally, the objective function of the training optimizer can be where m denotes the size of E, q t and Y t are the target vector and its context vector at the t-th node. The p is calculated by distribution computation function in GTS (Xie and Sun, 2019b).

Experiments
In this section, we first introduce the data that we use and the state-of-the-art baselines that we compare against. Then we show the implementation details of our experiments. Next, we demonstrate our results in comparison with other methods and provide extensive analyses. Finally, we conduct ablation studies and show some visualizations to investigate the effectiveness of our proposed components of our model (Reasoning with Pre-trained Knowledge and Hierarchical Structure, RPKHS).

Datasets and Evaluation
Datasets. We evaluate our proposed RPKHS and compare it with other state-of-the-art methods on two commonly-used datasets, namely MAWPS (Koncel-Kedziorski et al., 2016) with 2,373 problems and Math23K (Wang et al., 2018b) containing 23,162 problems.
Evaluation. As other works do (Xie and Sun, 2019b), for two datasets, we also measure the performance of our proposed method via the solution accuracy. For the Math23K dataset, there are two settings for evaluation on the previous methods.
One is evaluating the model on the test set (denoted as "Math23K" in Table 2). The other evaluation setting is using 5-fold cross-validation which is expressed in "Math23K*". We evaluate our model compared with other methods in both settings.

Implementation Details
We implement our proposed RPKHS via Py-Torch (Paszke et al., 2019) and python3.6 to train and test our RPKHS in math word problems. All experiments are conducted on the Ubuntu 18.04 from a server with 4 Tesla V100 GPUs. The Nvidia CUDA of 10.1 and cuDNN of 7.5 are utilized for acceleration. Unless noted otherwise, settings are the same for all experiments. We set the dimension of the word embedding to 128 and use the dimension of all hidden states for the other layers in our hierarchical reasoning encoder with 512. For our pre-trained knowledge encoder, we strictly follow the setting in (Liu et al., 2019c) and use their pre-trained weights as the initial weights in our pre-trained knowledge encoder. We utilize the aforementioned objective function L(E, P ) for all experiments. We set batch size to be 64 for 4 GPUs with 0.5 dropout (Hinton et al., 2012) rate, and set the weight decay as 1e-5 to prevent overfitting. We use Adam optimizer (Kingma and Ba, 2015) with an initial learning rate set to 0.0001 on pre-trained knowledge encoder and set to 0.001 on other parts of our model. The β 1 and β 2 in our optimizer are set as 0.94 and 0.99, respectively.
We adopt plateau learning rate scheduler that reduces the learning rate by half every 20 epoch. Our model is trained for 80 epochs. The beam size is set to be 5 in beam search to generate expression trees, which is inspired by the GTS (Xie and Sun, 2019b

Ablation Studies
The effect of our proposed components. As shown in Table 3, we use a word embedding layer, a LSTM layer and a tree-structured decoder as our baseline model, which achieves 74.9% accuracy on Math23K test set. After adding our wordlevel reasoning, it can boost the accuracy by 0.9% from baseline. We analyze the effect of sentencelevel reasoning and observe that it can promote the  While on vacation , Debby took 24 pictures at the zoo and 12 at the museum . If she later deleted 14 of the pictures , how many pictures from her vacation did she still have ?
There are 64 pigs in the barn . Some more come to join them . Now there are 86 pigs . How many pigs came to join them ?
On Saturday , Sara spent $ 10.62 each on 2 tickets to a movie theater . She also rented a movie for $ 1.59 , and bought a movie for $ 13.95 . How much money in total did Sara spend on movies ?
John and Jim needed to meet to discuss changes in a construction project. They were 880 miles apart . If they met after 8 hours and both traveled at the same speed , how fast did each go in miles per hour ?  baseline by 1.2% performance. Furthermore, after combing both of the reasoning processes, it can achieve 79.8% performance, which can validate the availability and superior ability of the hierarchical reasoning encoder. When it comes to the pretrained knowledge encoder, our model can reach a significant improvement from 74.9% to 80.1%, which strongly supports the feasibility of using implicit knowledge from pre-trained neural network on math word problems. Furthermore, the ability of combination between word-level reasoning and pre-trained knowledge gets great scores of 81.4%. The sentence-level reasoning collaborated with the pre-trained knowledge encoder increases accuracy by 2.2% compared with purely using pre-trained knowledge encoder.
The effect of different pre-trained knowledge. As shown in Table 5, we explore the effect of language-based knowledge from different pretrained transformer-based variants on the MWP task, which are BERT-base (Devlin et al., 2019b), BERT-large, Roberta-base (Liu et al., 2019c) and Roberta-large. We observe that more powerful pretrained linguistic models can achieve better performance (78.9%→83.9%). One of the reasons for these gains comes from the commonsense and factual knowledge in the transformer-based models, which has been pre-trained on large-scale corpora to capture the implicit knowledge. These experimental results can also support the effectiveness of using outside knowledge to assist in the MWP task.

Case Study
In Table 4, we perform a case study on the solution expressions generated by GTS, Graph2Tree and our RPKHS. Previous methods wrongly predict the operator (e.g., GTS in 1 st example, Graph2Tree in 2 nd example.) and calculation order (e.g., Graph2Tree in 1 st example and GTS in 2 nd example.). For the last example, GTS and Graph2Tree predict wrong quantities (e.g., '2.12-2.6 on GTS, '2.6-1.8' on Graph2Tree.) while our RPKHS is able to handle this situation better than them. We believe it is because our model encodes the MWP in richer representation by reasoning with pre-trained knowledge and hierarchical structure.

Visualizations
To validate that our model is able to select informative words and sentences in a problem description, we visualize the hierarchical attention weights in Figure 3 for four examples. Every line is a sentence (segment). Green denotes the sentence weight and blue denotes the word weight. Due to the hierarchical structure, we normalize the word weight by the sentence weight to make sure that only important words in important sentences are emphasized.
After looking through the four examples, we observe that our model can select the quantity words (positions) carrying strong contribution to the equation like 24, 12 and 14 in the 1 st case, 64 and 86 in the 3 rd case. Besides, our model usually can accurately localize the relationship between the quantities and their semantics, such as 2 tickets in 2 nd case and 880 miles in the 4 th case. Moreover, our model can deal with complex across-sentence contexts by building the correlation between different sentences. For instance, the 1 st sentence John and Jim... in the 4 th case seems to be unconsidered for solving the problem due to no quantity words inside it. However, our model figures out the 1 st sentence containing important quantity information when parsing the 4 th sentence (e.g., how fast did each...) via sentence-level reasoning. Through detailed visualized illustrations throughout the hierarchical reasoning process, we can reasonably interpret our results with concrete facts to show the effectiveness of our design.

Conclusion
We propose reasoning with pre-trained knowledge and hierarchical structure to jointly incorporate implicit knowledge and hierarchical representation into the neural network, which can be achieved by two encoders. A pre-trained knowledge encoder uses implicit knowledge for enhancing textual representation. A hierarchical reasoning encoder bridges the entity and context domain on MWP by building hierarchical reasoning between word-level and sentence-level reasoning. Extensive experiments show that the proposed model achieves a new state-of-the-art performance.