Neural-Symbolic Solver for Math Word Problems with Auxiliary Tasks

Previous math word problem solvers following the encoder-decoder paradigm fail to explicitly incorporate essential math symbolic constraints, leading to unexplainable and unreasonable predictions. Herein, we propose Neural-Symbolic Solver (NS-Solver) to explicitly and seamlessly incorporate different levels of symbolic constraints by auxiliary tasks. Our NS-Solver consists of a problem reader to encode problems, a programmer to generate symbolic equations, and a symbolic executor to obtain answers. Along with target expression supervision, our solver is also optimized via 4 new auxiliary objectives to enforce different symbolic reasoning: a) self-supervised number prediction task predicting both number quantity and number locations; b) commonsense constant prediction task predicting what prior knowledge (e.g. how many legs a chicken has) is required; c) program consistency checker computing the semantic loss between predicted equation and target equation to ensure reasonable equation mapping; d) duality exploiting task exploiting the quasi-duality between symbolic equation generation and problem’s part-of-speech generation to enhance the understanding ability of a solver. Besides, to provide a more realistic and challenging benchmark for developing a universal and scalable solver, we also construct a new largescale MWP benchmark CM17K consisting of 4 kinds of MWPs (arithmetic, one-unknown linear, one-unknown non-linear, equation set) with more than 17K samples. Extensive experiments on Math23K and our CM17k demonstrate the superiority of our NS-Solver compared to state-of-the-art methods.


Introduction
Deep neural networks have achieved remarkable successes in natural language processing recently. Although neural models have demonstrated performance superior to humans on some tasks, e.g. reading comprehension (Rajpurkar et al., 2016;Devlin et al., 2019;Lan et al.), it still lacks the ability of discrete reasoning, resulting in low accuracy on math reasoning. Thus, it is hard for pure neural network approaches to tackle the task of solving math word problems (MWPs), which requires a model to be capable of natural language understanding and discrete reasoning. MWP solving aims to automatically answer a math word problem by understanding the textual description of the problem and reasoning out the underlying answer. A typical MWP is a short story that describes a partial state of the world and poses a question about an unknown quantity or multiple unknown quantities. To solve an MWP, the relevant quantities need to be identified from the text. Furthermore, the correct operators along with their computation order among these quantities need to be determined. Therefore, integrating neural networks with symbolic reasoning is crucial for solving MWPs. Inspired by the recent amazing progress on neural semantic parsing (Liang et al., 2017a) and reading comprehension , we address this problem by neural-symbolic computing.
Recently, many researchers Huang et al., 2018;Wang et al., 2018bWang et al., , 2019Xie and Sun, 2019;Chiang and Chen, 2019), inspired by an encoder-decoder framework (Cho et al., 2014), apply neural networks to solve MWPs by learning the mapping function between problems and their corresponding equations, and achieve remarkable successes. The encoder uses a neural network to represent a problem as a real-valued vector, and the decoder uses another neural network to generate an equation or expression token by token. The main difference among previous methods is the way to decode expressions or equations. However, they only follow the encoder-decoder paradigm but lacking the ability to explicitly incorporate essential math symbolic constraints (e.g. commonsense constants, formulation regularization), leading to unexplainable and unreasonable predictions. Besides, most of them only focus on arithmetic MWPs without any unknown, preventing them from generalizing to various types of MWPs, such as equation set problems.
To address the above issues, we propose a novel Neural-Symbolic Solver (NS-Solver), which explicitly and seamlessly incorporates different levels of symbolic constraints by auxiliary learning tasks. Our NS-Solver consists of three main components, a problem reader to encode the math word problems into vector representations, a programmer to generate the symbolic grounded equations, which are executed to produce answers, and a symbolic executor to obtain final results. In addition to the supervised training objective between generated symbolic grounded equations and groundtruth equations, our solver is also optimized by four novel auxiliary objectives that enforce four levels of problem understanding and symbolic reasoning. First, we apply number prediction task to predict both the number quantity and number location in the problem in a self-supervised manner. Second, we deploy commonsense constant prediction task to predict what prior commonsense knowledge (e.g. how many legs a chicken has) is required for our solver. Third, we propose program consistency checker to compute the semantic loss between the predicted program and ground-truth equation to ensure reasonable equation mapping. Finally, we also propose a novel duality exploiting task that exploits the quasi duality between symbolic grounded equation generation and the problem's part-of-speech generation to enhance the understanding ability of our solver. There are some key advantages of our solution. First of all, the above four auxiliary tasks can produce additional training signals, which improves the data efficiency in training and makes our solver more robust. Second, using the predicted constant to constrain the target symbolic table can reduce the search space greatly, which means that our solver can generate correct symbolic grounded equations easier and better. Third, the auxiliary tasks have been proven to help reduce the domain gap between seen and unseen MWPs (Sun et al., , 2020, thus improving the reasoning ability of our solver. Besides, beyond the current large-scale highquality MWP benchmark that only includes one type of problems, we also construct a large-scale challenging Chinese MWPs dataset CM17K, which contains 4 types of MWPs (arithmetic MWPs, oneunknown linear MWPs, one-unknown non-linear MWPs, equation set problems) with more than 17K samples, to provide a more realistic and challenging benchmark for developing a universal and scalable math solver. Extensive experiments on public Math23K and our proposed CM17k demonstrate the superiority of our NS-Solver compared to stateof-the-art methods in predicting final results while ensuring intermediate equation rationality.
Neural-Symbolic Computing. Neural-symbolic computing has greatly promoted the development of semantic parsing. Jia and Liang (2016); Dong and Lapata (2016); Zhong et al. (2017) applied neural sequence-to-sequence and sequence-to-tree models to semantic parsing with full supervision. Liang et al. (2017bLiang et al. ( , 2018b have advanced the stateof-the-art in weakly supervised semantic parsing on knowledge graphs and tabular databases. Al-

Number Location Prediction
Today there are chickens and rabbits …  Figure 1: An overview of our NS-Solver. When a problem preprocessed by number mapping and replacement is entered, our problem reader encodes the problem text into context representation. Then our programmer generates a tree-structured symbolic grounded program explicitly. Finally, a symbolic grounded program will be executed to produce answers by the executor. In our NS-Solver, we apply four auxiliary tasks to enhance its problem understanding and symbol reasoning ability for generating better programs. though most of the successes of semantic parsing are limited to structured data sources, it is not expensive for MWPs since it is easy to crawl lots of problems with annotated equations and answers. Therefore, MWP solving can benefit from supervised neural-symbolic computing. Self-Supervised Learning. Self-supervised auxiliary tasks have been widely used in the fields of natural language understanding (Devlin et al., 2019;Lan et al.). Devlin et al. (2019) applied two selfsupervised auxiliary tasks, masked LM and next sentence prediction, to improve the understanding ability of BERT by pretraining. ALBERT (Lan et al.) introduces sentence-order prediction task to address the ineffectiveness of the next sentence prediction task in BERT. Hendrycks et al. (2019) show that self-supervised learning can improve model robustness and uncertainty. Dual Learning. Dual learning, first proposed by He et al. (2016), is a reinforcement training process that jointly trains a primal task and its dual task. Then Xia et al. (2017) considered it as a way of supervised learning and designed a probabilistic regularization term to exploit the duality. It has been widely applied in various fields, such as machine translation (He et al., 2016), sentiment classification (Xia et al., 2017), question answering (Tang et al., 2017), visual question answering (Li et al., 2018), machine reading comprehension , and code generation . To the best of our knowledge, we are the first to ex-ploit the duality in MWPs. Different from previous works, we design a quasi dual learning method between symbolic grounded equation generation and problem's part-of-speech generation to enhance the understanding ability by easing the difficulty of generating problems from symbolic equations.

Neural-Symbolic Solver
In this section, we present the design of the proposed NS-Solver. Its backbone mainly consists of a problem reader that encodes the math word problems into vector representations, a programmer to generate the symbolic grounded programs in prefix order, and a symbolic executor to obtain final results. The overview of our NS-Solver is visualized in Fig. 1. We first introduce the backbone of our NS-Solver in section 3.1, and then we introduce other auxiliary tasks in section 3.2.

Backbone
Problem Reader. Given a problem text P = {x i } n i=1 processed by number template replacement which maps numeric values in a problem to number templates (e.g., 26 and 82 to n 1 and n 2 in Fig. 1), the problem reader encodes each token x i in the problem text into an embedding e i . In this work, we deploy a two-layer bidirectional GRU to encode each token x i into an embedding h i are from forward and backward GRUs, respectively. Besides, our prob-lem encoder also outputs a problem representation g 0 = − → h n + ← − h 0 as the initial hidden state of our programmer, where − → h n and ← − h 0 are the last hidden state of forward and backward GRUs, respectively. Programmer. The programmer takes the output of the problem reader as input and the problem representation as the initial hidden state, and then decodes a problem as a sequence of tokens which are organized as a prefix equation tree. In this work, we deploy a tree-structured decoder (Xie and Sun, 2019) with attention mechanism (Bahdanau et al., 2015) as the backbone of our programmer and modify them with UET representation  to support more symbols for multiple types of MWPs. In our programmer, the symbolic table consists of four parts. For each problem, the problem-specific symbolic table contains math operators (+, −, * , /,, =, ;), unknown variable (x and y), a series of commonsense constants (1, 3.14, etc) predicted by the Commonsense Constant Prediction Task in 3.2, and the problemspecific number templates (n 1 , n 2 , n 3 , etc). It should be noticed that ; is a special operator with the lowest priority to integrate multiple equation trees as an ensemble equation tree, so that equation set problems can be handled as simple as arithmetic problems. Executor. We deploy sympy 2 , which is a python library for symbolic mathematics, as our symbolic executor for obtaining final results by solving generated equations.

The Design of Auxiliary Tasks
The MWP solving task remains challenging since previous methods did not take full advantage of the rich semantics contained in a problem and lacking the ability to explicitly incorporate essential math symbolic constraints. In this section, we introduce four auxiliary learning tasks to exploit additional training signals obtained from different tasks and exploit the result of the commonsense constant prediction task to explicitly constrain the constant symbolic table, which can reduce the search space for symbolic generation and ease the difficulty of generating correct constant. Self-supervised Number Prediction (SNP) Tasks. If a solver can fully understand the problem semantics, it should be able to identify the quantity of numbers in a problem (i.e., to count how many numeric values are in the problem) and 2 https://www.sympy.org/ their corresponding locations in the problem text accurately. For example, if the solver can understand the problem in Fig. 1, it should be able to predict there are two numbers(26 and 82) in the problem, and their positions are 15 and 18, respectively. Thus, number quantity prediction and number location prediction are two critical self-supervised tasks to help the problem reader fully understand the problem semantics and measure the ability of problem understanding of a solver. Both two number prediction tasks take the mean of the problem encoder's outputs {e i } n i=1 as their input and apply a single-layer feed-forward neural network to compute the distribution of number quantity and number locations. The training objectives of two tasks for each problem are formulated as: (1) where L N QP and L N LP denote the loss for the Number Quantity Prediction (NQP) task and Number Location Prediction (NLP) task, respectively. Q and L are the maximum possible quantities of number and maximum possible number locations for a problem at the dataset level. qt i and lt i represent the ground-truth value on i-th index of the output probability distribution of NQP and NLP, respectively. Commonsense Constant Prediction (CCP) Task. Commonsense constants are important for solving some MWPs while most previous methods only consider the constants 1 and 3.14, which are not enough for a solver to solve problems that need other commonsense constants. However, attaching a lot of constants to the problem-specific symbolic table will enlarge the search space, increasing the difficulty of generating rational symbolic equations. Therefore, we propose a commonsense constant prediction task to predict what prior commonsense knowledge (e.g. a chicken has 2.0 legs and a rabbit has 4.0 legs for the problem in Fig. 1) is required for the solver to solve a problem according to the problem context. In this way, we can reduce the search space greatly, thus improving the performance of our solver. Similar to the number prediction tasks, the commonsense constant prediction task takes the mean of the problem encoder's output {e i } n i=1 as their input and apply a single-layer feed-forward neural network to compute the distribution of number quantity and number locations The training objective for each problem is formulated as: where C is the total number of constants in the symbolic table and ct i represents the true value on i-th index of the output probability distribution.
Since it is impossible for the commonsense constant prediction task to achieve 100% accuracy, in addition to the predicted constants, we add three extra constants that are not predicted but with the highest probability into the symbolic table, making a better trade-off between the size of the search space and prediction accuracy. Program Consistency Checker (PCC). Although a problem can be solved by multiple equivalent but different equations, the predicted equations should be consistent with label equations as much as possible in the supervised learning setting. Therefore, we propose a program consistency checker to check the symbolic program consistency and regularize the model by computing semantic loss between the predicted symbolic program and ground-truth equation to ensure the reasonable symbolic equation mapping. Letŷ i and y i represent the predicted symbol and ground-truth symbol, p i represents the probability ofŷ i , the semantic loss is obtained by computing a distance between the predicted distribution and ground-truth distribution as: Duality Exploiting (DE) Task. Many previous works (He et al., 2016;Xia et al., 2017; have shown promising results by dual learning framework. Although intuitively, MWP solving and MWP generation are related to each other, i.e., the input of MWP solving is the output of MWP generation, and vice versa, it is very hard for the MWP generation task to generate good enough problems only by the equations without any topic information. Therefore, we propose a duality exploiting task to enhance the understanding ability of our solver by exploiting the quasi duality between symbolic grounded equation generation and the problem's part-of-speech generation. Given a pair of a problem and its corresponding equations (P ,T ), and P is the part-ofspeech of P 3 , the training objective of the duality exploiting task is formulated as: wherep(P ) andp(T ) are marginal distributions, which can be modeled by their LSTM (Hochreiter and Schmidhuber, 1997)-based language models, respectively. Besides, we deploy a tree-structure encoder inspired by GTS (Xie and Sun, 2019) to encode equations in prefix for POS generation.

Training Objective
Given the training dataset D={(P i , T 1 ), (P 2 , T 2 ), · · · ,(P N , T N ) }, where T i is the universal expression tree of problem P i , we minimize the following loss function for our NS-Solver: where m denotes the size of T, and y t denotes the t-th output. {λ i } 4 i=1 are empirical values that will be detailed in Section 4.2.
For the duality exploiting task, there is another loss for training the branch of the problem's partof-speech generation: where n denotes the size of P, and x t denotes the t-th output. L P CC is the semantic loss between predicted POS and the ground-truth POS. {λ i } 6 i=5 are empirical values that will also be detailed in Section 4.2.

CM17K Dataset
Most public MWPs datasets are quite small such as ALG514 or exist some incorrect labels such as Dolphin18K. An exception is the Math23K dataset, which contains 23161 problems labeled well with structured equations and answers. However, it only contains one-unknown linear math word problems, which is not sufficient to validate the ability of a math solver about solving multiple types of MWPs. Therefore, we introduce a new high-quality math word problems dataset, called CM17K, to validate the universality of a solver and provide a more realistic and challenging benchmark for developing a universal and scalable math solver. We collect CM17K from two education websites 4 . These problems are oriented grades 6-12, containing 4 types of MWPs with more than 17K samples, including 6215 arithmetic MWPs, 5193 one-unknown linear MWPs, 3129 one-unknown non-linear MWPs, and 2498 equation set problems. It should be noticed that our dataset is sufficient for validating the universality of math word problem solvers since these problems can cover most cases about MWPs. We label our data with structured equations and answers following Math23K . We split our CM17K into train/valid/test sets at a ratio of 8:1:1. The data statistics of Math23K and CM17K are shown in Table 1. From the statistics, we can see that all statistics of CM17K are larger than Math23K. This shows that our dataset is more challenging and difficult for math word problem solvers. Besides, since CM17K contains more types of MWPs than Math23K, CM17K is more suitable 4 http://www.zxxk.com/ and http://www.jyeoo.com/ for validating the reasoning ability of a solver than Math23K.

Datasets, Baselines, and Metric
We conduct experiments on Math23K and our CM17K. The main state-of-the-arts to be compared are as follows: DNS ) is a universal solver based on the seq2seq model with significant number identification (SNI). GTS (Xie and Sun, 2019) is a goal-driven tree-structured MWP solver. StackDecoder (Chiang and Chen, 2019) is an universal semantically-aligned math word problems solver. (Zhang et al., 2020a) is an enhanced GTS with teacher-student distillation and multi-decoder ensemble. Again, following prior works Chiang and Chen, 2019;Xie and Sun, 2019), we use answer accuracy as the evaluation metric: if the calculated value of the predicted equation tree equals to the true answer, it is thought as correct since the predicted expression is equivalent to the target expression.

Implementation Details
We use Pytorch 5 to implement our model on Linux with an NVIDIA RTX2080Ti GPU card. All those words with fewer than 5 occurrences are converted into a special token UNK. The size of word embeddings and all hidden states for other layers are set as 128 and 512, respectively. Our model is optimized by ADAM optimizor (Kingma and Ba, 2015) with β 1 = 0.9, β 2 =0.999, and = 1e −8 . The mini-batch size is set as 32. The initial learning rate is set as 1e −3 and then decreases to half every 40 epochs. To prevent overfitting, we set dropout rate as 0.5 and weight decay as 1e −5 . Finally, we conduct greedy search to generate symbolic equation trees. We set λ 1 , λ 2 , λ 3 , λ 5 , and λ 6 as 0.0005, 0.01, 1.0, 0.005, and 0.1 for both datasets, respectively. We set λ 4 as 0.000001 for Math23K while we set λ 4 as 1.0 for CM17K. All constants are extracted from the training set. In each epoch, all training data is shuffled randomly and then cut into mini-batches.

Answer Accuracy
Following prior works Chiang and Chen, 2019;Xie and Sun, 2019), we conduct 5fold cross-validation on Math23K. For CM17K, we evaluate the performance on the test set. The results are shown in Table 2. From Table 2, we can observe that benefiting from the four new auxiliary tasks and neural-symbolic paradigm, our NS-Solver outperforms the baselines on both datasets in terms of answer accuracy. Specifically, for Math23K and CM17K, the accuracy gains of NS-Solver over GTS are 1.37% and 5.93%, respectively. Comparing with TSN-MD, our solver outperforms it by about 0.6% on Math23K. It shows that our model is more feasible for solving multiple types of MWPs. It also shows that our NS-Solver is more effective than other state-of-the-art models on the real-world scenario that needs to solve various MWPs with a unified solver.

Comparisons on different subsets
We drill down to analyze the generalization of DNS, GTS, and NS-Solver on different types of MWPs in the test subset of CM17K. Their answer accuracy on different types of MWPs is shown in Table 3. We can observe that our NS-Solver outperforms the other two models by a large margin on all subsets. Specifically, the accuracy gains of our NS-Solver over GTS on four subsets are 3.87%, 9.12%, 6.99%, and 9.44%. This shows that with the help of four auxiliary tasks, our NS-Solver obtains better generalization ability on multiple types of MWPs than baselines.

Performance on Tree Length
Intuitively, the size of the symbolic equation tree is proportional to the complexity of the mathematical relationship in the problem. The more complex the mathematical relationship is, the more difficult it is to solve the problem. Here, we compare our proposed NS-Solver with GTS on CM17K to show the superiority of our NS-Solver on different equation tree sizes. The answer accuracies for different sizes of expression trees on CM17K test subset are shown in Fig. 2. We can see that there is a tendency for answer accuracy to degrade with the growth of the problem complexity measured as the size of the equation tree, and our NS-Solver outperforms GTS on most cases of different equation tree sizes. This shows our NS-Solver can better model the mathematical relationships of the problem than GTS. It can also be noticed that the improvement of our NS-Solver over the GTS is increasing when the problems become more complex.
However, although our model outperforms other methods, there still has room for improvement in semantic understanding and symbolic reasoning since longer equations often match with more complex MWPs which entail more complex math relationships.

Ablation on different auxiliary tasks
We study the contribution of different auxiliary tasks of our NS-Solver. For this purpose, we consider five different combinations: 1) only the backbone [NS-Solver -CCP -SNP -PCC -DE]; 2) backbone + duality exploiting task [NS-Solver -CCP -SNP -PCC]; 3) backbone + duality exploiting task + program consistent checker [NS-Solver -CCP -SNP]; 4) backbone + duality exploiting task + program consistent checker + number prediction tasks [NS-Solver -CCP]; and 5) the proposed NS-Solver [NS-solver]. For each of these combinations, each model was trained for 80 epochs on CM17K and validated on its test subset. The learning rate decreased to half every 20 epochs. The results are provided in Fig. 4.
As one can see, all four auxiliary tasks can improve performance. Specifically, the accuracy gains of DE, PCC, SNP, and CCP are 1.00%, 1.41%, 1.11%, and 1.12%, respectively. Besides, the binary accuracies of the two SNP tasks are 97% (number quantity prediction) and 96.8% (number location prediction). Moreover, the accuracy of our CCP

Case Study
We also present the results of our NS-Solver with different combinations of four auxiliary tasks in Fig. 3. Benefiting from explicitly exploiting the probabilistic correlation between two quasi dual tasks to regularize the training process in our duality exploiting (  To show that our auxiliary tasks can be adapted to other backbones, we replace GTS's encoder with BERT (BERT + Tree Decoder) and NS-Solver's encoder with BERT (NS-Solver + BERT), where we adopt a Chinese BERT-base pre-trained with whole word masking (Cui et al., 2020). We conduct experiments on CM17K. The results are shown in Table 4. We can observe that with auxiliary tasks, our NS-Solver + BERT still can outperform BERT + Tree Decoder, which shows that our auxiliary tasks' strong generalization.

Conclusion
In this work, we propose Neural-Symbolic Solver (NS-Solver) to explicitly and seamlessly incorporate different levels of symbolic constraints by four auxiliary tasks. Our NS-Solver consists of a problem reader to encode problems, a programmer to generate a symbolic grounded program, and a symbolic executor to obtain final results. In addition to supervised learning with target expression, our solver is also optimized via four new auxiliary objectives that enforce four levels of symbolic reasoning. Besides, we also construct a new dataset CM17K containing 4 types of MWPs with more than 17K samples, which provides a more realistic and challenging benchmark for developing a universal and scalable math solver. Extensive experiments on Math23K and CM17K demonstrate the superiority of our NS-Solver compared to state-ofthe-art methods in answer accuracy while ensuring intermediate equation rationality.

Ethical Impact
We collected CM17K from two online education websites, which is only used for academic research, and the copyright belongs to the original websites. This work may inspire research in the field of numerical reasoning.