Math Word Problem Solving with Explicit Numerical Values

In recent years, math word problem solving has received considerable attention and achieved promising results, but previous methods rarely take numerical values into consideration. Most methods treat the numerical values in the problems as number symbols, and ignore the prominent role of the numerical values in solving the problem. In this paper, we propose a novel approach called NumS2T, which enhances math word problem solving performance by explicitly incorporating numerical values into a sequence-to-tree network. In addition, a numerical properties prediction mechanism is used to capture the category and comparison information of numerals and measure their importance in global expressions. Experimental results on the Math23K and APE datasets demonstrate that our model achieves better performance than existing state-of-the-art models.


Introduction
Taking a math word problem as input, the math word problem solving task aims to generate a corresponding solvable expression and answer. With the advancements in natural language processing, math word problem solving has received growing attention in recent years (Roy and Roth, 2015;Mitra and Baral, 2016;Ling et al., 2017;Huang et al., 2018). Many methods have been proposed that use sequence-to-sequence (seq2seq) models with an attention mechanism (Bahdanau et al., 2014) for math word problem solving (Wang et al., 2017b(Wang et al., , 2018b. To better utilize expression structure information, some methods use sequenceto-tree (seq2tree) models to generate expressions and have achieved promising results (Liu et al., 2019;Xie and Sun, 2019;Wu et al., 2020). These methods convert the target expression into a binary tree, and generate a pre-order traversal sequence of this expression tree based on the parent and sibling nodes of each node.
Although promising results have been achieved, previous methods rarely take numerical values into consideration, despite the fact that in math word problem solving, numerical values provide vital information. As an infinite number of numerals can appear in math word problems, it is impossible to list them all in the vocabulary. Previous methods replace all the numbers in the problems with number symbols (e.g., v 1 , v 2 ) in order in the preprocessing stage. These replaced problems are used as input to directly generate expressions containing number symbols. The number symbols in the expressions are then replaced with the numerical values in the original problems to obtain executable expressions. As shown in Figure 1, taking the problem with numerical values {v 2 =15, v 3 =10, v 4 =100, v 5 =25} as input, the target expression of the problem would be "v 4 /(v 2 − v 3 ) + v 5 ". However, if the number symbol v 5 = 20%, the target expression for the same problem would be "v 4 /(v 2 − v 3 ) * (1 + v 5 )". Similarly, without numerical value information, the model can hardly determine whether the number gap between the table and the chair should be v 2 − v 3 or v 3 − v 2 . As such, it will incorrectly generates the same expression for problems with different numerical values.
To address these problems, we propose a novel approach called NumS2T to better capture numerical value information and utilize numerical properties. Specifically, the proposed model uses a sequence-to-tree network with a digit-to-digit number encoder that explicitly incorporates numerical values into the model and captures number-aware problem representations. In addition, we designed a numerical properties prediction mechanism to further utilize the numerical properties. NumS2T predicts the comparative relationship between paired numerical values, determines the category of each numeral, and measures their importance for generating the final expression. With the category and comparison information, the model can better identify the interactive relationship between the numerals, and thus generate better results. With consideration of the importance of the numerals, the model can capture the global relationship between the numerals and target expressions rather than simply focusing on the local relationship between numeral pairs.
The main contributions of this paper can be summarized as follows: • We explicitly incorporate numerical value information into math word problem solving tasks.
• We propose a numerical properties prediction mechanism to utilize numerical properties. To incorporate the local relationship between numerals and the global relationship associated with the final expression, NumS2T compares the paired numerical values, determines the category of each numeral, and then measures whether they should appear in the final expression.
• We conducted experiments on two largescale Math23K and Ape210K datasets to verify the effectiveness of our NumS2T model. The results show that our model achieved better performance than existing state-of-theart methods.

Models
In this section, we present details regarding our proposed NumS2T model. As shown in Figure 2, we use an attention-based sequence-to-tree model with a problem encoder (Section 2.2) and a treestructured decoder to generate math expressions (Section 2.4). In addition, we explicitly incorporate numerical values to obtain number-aware problem representations (Section 2.3). Finally, we propose a numerical properties prediction mechanism to further utilize the numerical properties (Section 2.5).

Problem Definition
A math word problem X = (x 1 , x 2 , . . . , x m ) is a sequence of m words. Our goal is to generate a math expression Y = (y 1 , y 2 , . . . , y n ), where Y is the pre-order traversal sequence of a binary math expression tree, which can be executed to produce the answer to problem X.
Here, we replace all of the numbers in the problem X with a list of number symbols based on their order of appearance. Let V c = (v 1 , v 2 , . . . , v K ) be the K numbers that appear in problem X. The numerical value of the k-th number v k is a sequence of l characters (v 1 k , v 2 k , . . . , v l k ). The generated vocabulary V g is composed of several common numbers (e.g., 1,100,π) and several math operators (e.g., +,-,*,/). At each time step during decoding, the NumS2T model either copies a number from V c or generates a number from V g .

Problem Encoder
We use a two-layer bidirectional LSTM (BiL-STM) (Hochreiter and Schmidhuber, 1997) network as the encoder, which encodes the math word problem X into a sequence of hidden states 5861 + − 7 ≥ 6.5 7 < 8 7 < 18 7 ≥ (1/3) 6.5 < 7 − 6.5 < 8 6.5 < 18  , which are then concatenated with the problem hidden states in (a) to obtain number-aware problem representations h i . In addition, we propose (c) a numerical properties prediction mechanism for comparing the paired numerical values, determining the category of each numeral, and measuring whether they should appear in the target expression.
Here, word embedding vectors E(x i ) are obtained via a wording embedding layer E(·). d is the dimension of the hidden state and h x i is the concatenation of the forward and backward LSTM hidden states.
Following Wu et al. (2020), we enrich the problem representations with common-sense knowledge information from external knowledge bases. The words in problem sequences X and their categories in external knowledge bases are constructed as an entity graph. In this entity graph, each word is related to its neighbor in the problem. If there are two nouns belonging to the same category in the knowledge base, these two nouns are related to their categories. See Wu et al. (2020) for more details.
The knowledge-aware problem states h kg i are obtained from a two-layer graph attention network (Veličković et al., 2018) on the entity graph: where w T h , W x , W k are weight vector and matrices. || and [:] are concatenation functions. f (·) and σ are the LeakyRelu and sigmoid activation functions. T is the number of heads in GAT layer. If the i-th word is related to the j-th word, the score of the adjacent matrix A ij is set to 1, otherwise it is set to 0.

Number-aware Problem Representations
To solve the issues mentioned in the introduction section, we need to incorporate explicit numerical value information into NumS2T. However, there are an infinite number of numerals that can appear in math word problems. For example, among the 18,529 problems in the training set of Math23K, there are 3,058 different numerical values. Therefore, rather than list all these numerals in the vocabulary, we encode each numeral value digit by digit.
All the digits in the numerical value v k are treated as a sequence (v 1 k , v 2 k , . . . , v l k ) and embedded via the embed layer E(·). Take a 5-digit value v k = (1/3) as an example, we have E(v k ) ∈ R 5×d emb . Similar to the architecture shown in Equation 1, we use a BiLSTM network to encode the numeral values and obtain the numeral hidden states h v k with an average pooling layer: To capture the relations and dependency between numeral pairs, we use a self-attention mechanism (Wang et al., 2017a) on the hidden state of all the numerals H n v = {h n v k } K k=1 to compute the contextual numeral hidden states h cn v k : where α v k is the attention distribution of v k on all the numerals in the problem X.
Combining the numeral hidden states h n v k , h cn v k with the original problem hidden states h x i , h kg i , we have number-aware problem states h num i enhanced with explicit numeral value information: The final number-aware problem representations are obtained by concatenating the problem hidden states h x i , the knowledge-aware problem states h kg i and the number-aware problem states h num i :

Tree Structured Decoder
Previous works (Xie and Sun, 2019;Liu et al., 2019;Wu et al., 2020) have confirmed that a sequence-to-tree model can better represent the expression structures than a sequence-to-sequence model, because a tree structured decoder can capture the global expression information and focus on the features of adjacent nodes. The tree structured decoder takes the final number-aware problem representations h i as input and generates the target expression from top to bottom. The target expression can be regarded as a pre-order traversal of a binary tree, with operators as internal nodes and numbers as leaf nodes. The decoder is a one-layer LSTM, which updates its states as follows: At time step t+1, the decoder uses the last generated word embedding E(y t ), the problem context state c t and the expression context state r t to update its previous hidden state s t .
The problem context state c t is computed via attention mechanism as follows: where W h , W s are weight matrices. α ti is the attention distribution on the number-aware problem representations h i . The expression context state r t is computed via a state aggregation mechanism (Wu et al., 2020). It describes the global representation of the partial expressions y <t = (y 1 , y 2 , . . . , y t−1 ) being generated by the decoder. At time step t, the decoder aggregates each node's context state with its neighbor nodes in the generated partial expression tree. The aggregation functions are as follows: where σ is the sigmoid function and W r is a weight matrix. r 0 t is initialized with decoder hidden state s t when η = 0,. r t,p , r t,l , r t,r are the context state of the parent node, the left child node, and the right child node of y t in the expression tree. r η+1 t represents the expression context state updated with global information from all nodes in the generated partial expression.
Lastly, the decoder can generate a word from a given vocabulary V g . It can also generate a number symbol in V c , and use it to copy a number from the problem X. The final distribution is the combination of the generated probability and copy probability: Here, H v are the number-aware problem representations of all the numerals v k in X. W z , W v are the weight matrices. f (·) is a perception layer. p c is the probability that the current word is a number copied from the problem.

Numerical Properties Prediction Mechanism
Our NumS2T model explicitly incorporates numerical values information. Furthermore, utilize the numerical properties to the degree possible through a numerical properties prediction mechanism. We consider three numerical properties to be useful for solving math word problems: Pairwise Numeral Comparison. If we consider the question "What is the difference between v 1 and v 2 ," the comparative relationship between these two numerals can help the model decide whether to generate v 1 − v 2 or v 2 − v 1 . In this paper, we compare each numeral v k in the question with the other numerals. Then, we calculate the pairwise comparison scores z kj based on their number-aware problem representations, and we optimize the pairwise comparison loss to assign numerals with larger numerical values higher pairwise comparison scores. The pairwise comparison loss L CR is calculated as follows: Numeral categories. In the sentence "the number of apples is 5 more than the number of pears," replacing the numeral 5 with the integer 100 may not affect the structure of the target expression, but replacing the numeral 5 with 20% may change the structure from "+5" to "*(1 + 20%)". We roughly divide all numbers into four categories: {integer, decimal, fraction, percentage}, and assign a category label C = {1,2,3,4}, respectively. Given the number-aware problem representations h v k for each numeral v k , we calculate the category score distribution P(C v k |h v k ) and then minimize the negative log likelihood: Global relationship with target expressions. Current models tend to focus on the local relationship between numerals, while sometimes these numerals are not related to the target expression. Given "3 bags of rice weighing 60 kg," the numeral 3 is highly correlated with 60. However, if the problem relates to the total price of the rice rather than the weight of each bag of rice, the numeral 3 is not so important for generating the target expression. The NumS2T model predicts a scalar value g v k for each numeral that denotes whether this numeral will be used in a math expression. The importance label a v k =1 when v k is used in the ground truth math expression, otherwise a v k =0. The supervised loss is defined by:

Training
During training, for each question-expression pair (X, Y), we first train the NumS2T by optimizing the maximum likelihood estimation (MLE) loss L l on the probability distribution P(y t |y <t , X)). Then, the final loss function L is a combination of the MLE loss and three numerical properties loss functions: log P(y t |y <t , X)), 5864 Here, β 1 , β 2 , β 3 are hyper-parameters.

Dataset
We present the experimental results of math word problem solving using our proposed models on the Math23K (Wang et al., 2017b) and Ape210K (Zhao et al., 2020)  We report answer accuracy as the main evaluation metrics of the math word problem solving task.

Implementation Details
In this paper, we truncate the problem to a max sequence length of 150, and the expression to a max sequence length of 50. We select 4,000 words that appear most frequently in the training set of each dataset as the vocabulary, and replace the remaining words with a special token UNK. We initialize the word embedding with the pretrained 300-dimension word vectors 3 . The problem encoder used two external knowledge bases: Cilin (Mei, 1985) and Hownet (Dong et al., 2010). The number of heads T in GAT is 8. The hidden size is 512 and the batch size is 64. We use the Adam optimizer (Kingma and Ba, 2014) to optimize the models an the learning rate is 0.001. We compute the final loss function with β 1 , β 2 , β 3 of 0.5. Dropout (Srivastava et al., 2014) is set to 0.5. Models are trained in 80 epoches for the Math23K dataset and 50 epoches for the Ape210K dataset. During testing, the beam size is set to 5. Once all internal nodes in the expression tree have two child nodes, the decoder stops generating the next word. The hyper-parameters are tuned on the valid set.

Baselines
We compare our proposed NumS2T model with the following baseline models: DNS (Wang et al., 2017b) is a seq2seq model with a two-layer GRU as an encoder and a two-layer LSTM as a decoder. DNS-Retrieval is a variant of DNS that combines a retrieval model. S2S (Wang et al., 2018a) is a standard bidirectional LSTM-based seq2seq model with an attention mechanism. RecursiveNN (Wang et al., 2019) uses a recursive neural network on the predicted tree structure templates Tree-Decoder (Liu et al., 2019) is a seq2tree model with a tree structured decoder. The decoder generates each node based on its parent node and its sibling node. GTS (Xie and Sun, 2019) generates each node based on its parent node and its left sibling subtree embedding. The subtree embedding is obtained by merging the embedding of the subtree from bottom to top. KA-S2T (Wu et al., 2020) is a seq2tree model with external knowledge and a state aggregation mechanism. The decoder use a two-layer GCN to recursively aggregate neighbors of each node in the partial expression tree.

Results Analysis
The main evaluation results are presented in Table  1. Compared with baseline methods, our model obtains the highest answer accuracy of 78.1% in the Math23K dataset and 70.5% in the APE210K dataset, which is significantly better than other state-of-the-art methods. The experimental results provide the following observations: 1) The methods with a tree-structured decoder (Tree-Decoder, GTS, KA-S2T) perform better than methods with a sequence-structured decoder (DNS, S2S). These methods treat the math expression as a binary tree and directly use adjacent nodes in the tree instead of the previous word in the sequence to generate the next word. In this way, the model can better capture the structure information of the math expressions.
2) The KAS2T model with external knowledge performs better than GTS, which proves that external knowledge enables the model to obtain better interaction between words.
3) NumS2T outperforms all the other baselines. This result shows the effectiveness of the explicitly incorporated numerical values and use of a numerical properties prediction mechanism.

Ablation Study
Effect of explicitly incorporating numerical values: We designed several NumS2T variants that reduce the numerical values incorporated in the model. Here, "NumS2T w/o Numerals" means that we remove the character-level numeric value encoder. An input example is "Alan bought v 1 apples for $ v 2 ". "NumS2T w/o Symbols" means that we not only remove the character-level numeric value encoder, but also replace the math symbols in math problems with character-level numeric values. An input example is "Alan bought 2 5 apples for $ 1 5 0". Table 2 shows the results of these different variants, from which we can see: 1)The experimental results show that model performance of "NumS2T w/o Symbols" is significantly reduced in both datasets. We believe this is because directly replacing the number symbols will make it difficult for the model to obtain the overall representation of each number.
2) The use of a self-attention mechanism significantly improves the accuracy by 0.8% in Math23K and 0.7% in APE210K. This is because the same numerical value may describe different information in different problems. Therefore, the self-attention mechanism combines numerical values with other numerical values in the problem, which helps to model numerical information and the relations between these numerals.
3) Without numerical values, the answer accuracy of "NumS2T w/o Numerals" would be reduced to 76.6% and 69.2%. The results show the benefit of explicitly incorporating numerical values. Effect of the numerical properties prediction mechanism: Table 3 shows the results of several NumS2T variants designed to measure the effect   of the numerical properties prediction mechanism. From the table we can observe that: 1) NumS2T-base is the variant of NumS2T without the numerical properties prediction mechanism. Without numerical properties, the answer accuracy in the Math23K and APE210K datasets are reduced to 77.0% and 69.6%, which show that the numerical properties prediction mechanism contributes considerably to improving performance. In addition, NumS2T-base still outperforms the state-of-the-art baseline KA-S2T, which once again proves the effectiveness of explicitly incorporating numerical values.
2) The use of pairwise numeral comparison, numeral category and global relationship with a target expression can improve accuracy by approximately 0.6%, 0.4% and 0.3%, respectively. Their combination achieves further improvements in model performance. These results show the effectiveness of the numerical properties prediction mechanism because it enables the model to further utilize numerical properties. Model performance on problems with a different number of numerals: Table 4 shows the results for how accuracy changes as the number of numerals in the problem increases. The NumS2T model outperforms the best-performing baseline with respect to problems with a different number of  numerals. In addition, as the number of numerals in the problems increase, the performance gap between NumS2T and KAS2T also increases. This is because with more numerals in the problem, NumS2T, which explicitly incorporate numerical value information, is able to more readily achieve better performance. Meanwhile, NumS2T also achieved a considerable improvement on problems with only one numeral. This further demonstrates the effect of utilizing numerical category information and global relationship information.  correct results. In the third problem, 80 seats and 52 tickets are strongly semantically related, so KA-S2T generates the sub-expression "80-52". However, this problem is about the fares that have already been sold rather than how many tickets are left. With numerical properties, NumS2T is able to realize that 80 is not related to the target expression and should not appear in the generated result.

Related Work
Math Word Problem Solving: In recent years, Seq2Seq  has been widely used in math word problem solving tasks (Ling et al., 2017;Wang et al., 2017bWang et al., , 2018a. To better utilize expression structure information, recent studies have used Seq2Tree models (Liu et al., 2019;Zhang et al., 2020a). Xie and Sun (2019) proposed a tree structured decoder that uses a goal-driven approach to generate expression trees. Wu et al. (2020) proposed a knowledge-aware Seq2Tree model with a state aggregation mechanism that incorporates common-sense knowledge from external knowledge bases. Recently, several methods have attempted to use the contextual information of the numbers in the problem.  propose a group attention mechanism to extract quantity-related features and quantitypair features. Zhang et al. (2020b) connects each number in the problem with nearby nouns to enrich the problem representations.
However, these methods rarely take numerical values into consideration. They replace all the numbers in the problems with number symbols and ignore the vital information provided by the numerical values in math word problem solving. As such, these methods will incorrectly generates the same expression for problems with different numerical values. Numerical Value Representations: Some recent studies have explored the numerical value representations in language models (Naik et al., 2019;Chen et al., 2019;Wallace et al., 2019). Spithourakis and Riedel (2018) investigated several of the strategies used for language models for their possible application to model numerals. Gong et al. (2020) proposed the use of contextual numerical value representations to enhance neural content planning by helping models to understand data values. To incorporate numerical value information into math word solving tasks, we use a digit-todigit numerical value encoder to obtain the numberaware problem representations. To further utilize the numerical properties, we propose a numerical properties prediction mechanism.

Conclusion
In this study, we proposed a novel approach called NumS2T, that better captures numerical value information and utilizes numerical properties. In this model, we use a digit-to-digit numerical value encoder to explicitly incorporate numerical values. In addition, we designed a numerical properties prediction mechanism that compares the paired numerical values, determines the category of each numeral, and measures whether they should appear in the final expression. Experimental results show that our proposed NumS2T model outperforms other state-of-the-art baseline methods.