Compositional Mathematical Encoding for Math Word Problems

Solving math word problem (MWP) remains a challenging task, as it requires to understand both the semantic meanings of the text and the mathematical logic among quantities, i.e., for both semantics modal and quantity modal learning. Current MWP encoders work in a uni-modal setting and map the given problem description to a latent representation, then for decoding. The generalizability of these MWP encoders is thus limited because some problems are semantics-demanding and others are quantity-demanding. To address this problem, we propose a Compositional Math Word Prob-lem Solver (C-MWP) which works in a bi-modal setting encoding in an interactive way. Extensive experiments validate the effectiveness of C-MWP and show its superiority over state-of-the-art models on public benchmarks.


Introduction
The task of math word problem (MWP) solving aims to map natural language problem descriptions into executable solution equations to get the correct answer, which is a sub-area of neuro-symbolic reasoning.It requires perceptual abilities such as comprehending the question, identifying the quantities and corresponding attributes, as well as complex semantics understanding skills like performing logical inference, making comparisons and leveraging external mathematical knowledge.
While MWP encoders have been sophisticatedly designed to understand the natural language problem description, the difference on understanding diverse types of problems has not been aware of.We find that MWP can generally be grouped into three categories based on the keywords in (Liu et al., 2019), i.e., "Story Problem", "Algebra Problem" and "Knowledge Problem"."Story Problem" often includes significant amount of background information like characters, objectives and behaviors.
"Algebra Problems" involves math notations or is composed of elementary concepts."Knowledge Problem" asks for external knowledge like geometry and number sequence, as shown in Figure 1.
These types of problems can be compositionally understood at the different level attention to the semantics modal and quantity modal, where RNNs and pre-trained language models usually focus on the textual information and GCN with quantitycentered graphs capture the relationship between quantities and contexts.However, the encoders in existing MWP solvers either model only the semantics modality or utilize quantity modal priors to refine the MWP encoding (Zhang et al., 2020;Shen and Jin, 2020).Although the quantity centered refinement can particularly make improvements on quantity-demanding problems, its semantics understanding is weakened (evidence can be found in Table 3).This limitation, one joint modal cannot do it all, decreases the generalization of MWP solvers and is what compositional learning aims to address.In this work, we propose to disentangle semantics modal and quantity modal by compositional learning at the encoding stage, aiming to improve the generalization across different types of problems.
Contributions. (i) A novel and effective bimodal approach is proposed for the first time to enable MWP compositional understanding.(ii) A joint reasoning module is designed for our bimodal architectures to flexibly incorporate different modalities.(iii) Extensive experiments and ablative studies on two large-scale MWP benchmarks -Math23k (Wang et al., 2017) and MAWPS (Koncel-Kedziorski et al., 2016) show the superiority of the proposed approach over related works.
Text : 348 teddy bears are sold for $23 each.There are total 470 teddy bears in a store and the remaining teddy bears are sold for $17 each.How much did the store earn after selling all the teddy bears?

Semantics:
Some teddy bears have been sold at a price; The left part will be sold at a different price; The goal is to compute the expect income.
Text : 2 times A is the same as 3 times B. B equals 28.
Compute A.

Story Problem:
Algebra Problem:

Related Work
Compositional Learning in NLP.Modeling compositionality in language has been a longstanding issue (Wong and Wang, 2007) in NLP community.One common practice is to perform disentanglement over language representations at different levels (Welch et al., 2020)..They usually focus on atomic semantics units like character, word and phrase.As logic form annotations naturally own compositional features, compositionality is incorporated in generating correct logic contents.Therefore, the compositionality is often injected into traditional semantic parsing tasks (Chen et al., 2020;Yang et al., 2022) where the goals during training can be decomposed and then reorganized as a novel goal.
Our work firstly tries to inject compositional prior into MWP encoding.It is worth noting that MWP solving owns the same well-organized logic form annotations as machine reasoning, which naturally requires compositionality.
Math Word Problem Solving.Earlier MWP solvers parse problem descriptions semantically, and learn templates for generating answers (Koncel-Kedziorski et al., 2015).Recent works (Wang et al., 2017;Xie and Sun, 2019;Li et al., 2019;Zhang et al., 2020;Shen and Jin, 2020;Wu et al., 2021b,a;Lin et al., 2021;Liang and Zhang, 2021;Jie et al., 2022) focus on employing the encoderdecoder framework (e.g., sequence-to-sequence, sequence-to-tree, graph-to-tree) to translate MWP texts into equations based on traditional RNN structure.There are also new settings (Amini et al., 2019;Miao et al., 2020) introduced to extend MWP solving in equation group generation and diagnosing awareness of external knowledge.Nowadays, many researchers build strong MWP solvers upon pre-trained language models (PLMs) (Huang et al., 2021;Li et al., 2021;Yu et al., 2021;Shen et al., 2021;Lan et al., 2022) and have achieved great performance.Differently, our work lays the groundwork of feature extraction of quantity modal, which is orthogonal to those works.
In this work, we not only propose an explicit compositional encoding module with a multi-layer design, but also incorporate detailed analysis to verify its compositional learning ability, to jointly leverage semantic and quantity information to achieve effective MWP understanding.
3 Our approach

Compositional Mathematical Encoder
As shown in Figure 2, our CMEncoder block consists of a semantic encoder, a quantity encoder and a dynamic fusion block.The semantic encoder aims to extract semantic information from the problem description, understanding the background and objectives.The latter part encodes problems only with quantity-related graphs, helping the encoder to know the properties of quantities and the relationship between quantities and contexts.Semantic Encoder.To demonstrate the robustness of our approach, we implemented two different semantic encoders as our backbone.Firstly, we encode the problem description W by a bidirectional gated recurrent unit (BiGRU) (Cho et al., 2014).The outputs of GRU are hidden state vectors of all tokens, H r = {h 1 , h 2 , ..., h n }, where n is the length of problem W .
where Embed s (W ) is the embedding result of textual description W in semantics modal.Empirically, we find that two stacked CMEncoders as shown in Figure 2 achieve the best performance.Secondly, pre-trained language models (PLMs) have been ubiquitous in NLP tasks.We use the latest push of MWP-BERT (Liang et al., 2022) as our semantic encoder to obtain H r .
Quantity Encoder.To encode the quantity modal in the problem W , we feed a graph transformer G trans with Quantity Comparison Graph and Quantity Cell Graph following Graph2Tree (Zhang et al., 2020), where Embed q (W ) is the embedding matrix in the quantity modal, which aims to improve the quantity representation by incorporating quantity magnitude information and quantity-context relationship with the above two graphs.Different from Graph2Tree, the two embeddings Embed s (W ) and Embed q (W ) are updated in the training process to extract the semantics and quantity feature separately.In this way, semantics and quantity modals are disentangled, which alleviates the issue of "one joint modal cannot do it all", enabling the C-MWP solver to pay different levels of attention when solving different problems.
Dynamic Fusion.To achieve joint reasoning over the semantics information and quantity information, we design a dynamic fusion module to flexibly incorporate the features from these two modals.First, we get s and q from the mean pooling of H r and H g , respectively.Then, cross-modal attention is applied between H r and q, H g and s: Att where the attention scores a i , b i come from: where W 1 a , W 2 a , W 1 b and W 2 b are parameter matrices.The cross-modal attention here grounds the quantity information in the semantics modal, and vice versa.By applying different weights on different modals, our model is flexible to pay more or less attention on a certain modal.Finally, the output of dynamic fusion is: (5)

Stack Multiple CMEncoders
Humans often need to make multiple glimpses to refine an MWP solution.Similarly, a CMEncoder can be stacked in multiple steps to refine the understanding of an MWP, as shown in Figure 2. Given the output from the semantic encoder, quantity encoder and dynamic fusion module at layer k − 1, the features are stacked as: where the attention weights c r and c g are: )) where W 1 r , W 2 r , W 1 g and W 2 g are parameter matrices.The H (k−1) att will be the input for both the semantic modal and quantity modal of K-th CMEncoder, which will output f , which can be sent for the update at layer k + 1.
After finishing the K-th step reasoning, we concatenate the final H (K) r and H (K) g as the final output representation H f inal .

Decoder
We follow the same implementation as GTS (Xie and Sun, 2019).Eventually, the decoder will output the pre-order traversal sequence of the solution tree.

Training Method
Given the training samples with problem description W and the corresponding solution S, the main training objective is to minimize the negative log probability for predicting S from W , empowered by the compositionality of the CMEncoders.Therefore, the overall loss is: where L M W P is the negative log prediction probability − log p(S | W ). The L 2 norm of the encoder embedding matrices is added to the loss function as regularization terms.

Experimental Results
As Table 1 shows, our approach outperforms all other RNN-based baselines in terms of answer accuracy On Math23k, we outperform the latest RNNbased push from Wu et al. (2021a) by 1.8%.For the first time, an RNN-based MWP solver reaches over 80% answer accuracy on the Math23k dataset.What is more, the even fewer parameters with the best performance suggest that our model is also memory-efficient.
PLM-based solvers benefit from the pre-training on a huge amount of corpus and thus achieve great semantic understanding ability.From a different point of view, our work aims to effectively and efficiently integrate semantic and quantity understanding.Therefore, by incorporating the MWP-BERT model as our semantic extractor, the answer accuracy of C-MWP achieves state-of-the-art performance.It proves the feasibility of combining the feature from the GNN Encoder of Graph2Tree.
Performance on Different Types of MWP.
In order to investigate how our model performs across various types of MWP, we introduce a new split of Math23k with regard to three types of problems: story problems, algebra problems and knowledge problems.Split details are shown in the appendix.The evaluation results are presented in Table 3.Without a compositional manner, Graph2Tree and Multi-E/D perform better than GTS on story and algebra testing problems, whereas they perform worse on knowledge problems.As stated before, one joint modal cannot do it all.These baselines work well on some types of problems while having weak performance on other types of problems.Our C-MWP offers a general accuracy improvement, which firmly supports our motivation for alleviating the generalization issue.This provides clear evidence that our model leverages general math knowledge across different types of MWP, successfully solving some nontrivial problems that Graph2Tree failed to solve.

Conclusion and Future Work
The semantic meaning and quantity information are important intrinsic properties of a math word problem.Aiming at dealing with uni-modal bias and achieve better generalization, we make the first attempt to propose a compositional MWP solver, C-MWP.Multi-layer reasoning and specified training methods are leveraged to enhance the generalizability of the model.As the method could be applied in a broader range of neuro-symbolic learning problems, we will keep exploring the adaptiveness of this compositional encoding method.

Limitations
Explainability Most current MWP solvers are only able to generate solutions.In our work, although we achieved better generalization ability, it is still hard to explain how the model solves MWPs both correctly or incorrectly.These automated solvers would be much more helpful for tutoring students if they could explain their equation solutions by generating reasoning steps.

Hyper-Parameter Tuning
In general, we apply grid-search with manually designed search space and use answer accuracy as the evaluation metric to select the hyper-parameters.
For the number of stacked encoders, the search space is {1, 2, 3, 4} and we finally use 2. For the weight of L 2 normalization loss, we choose weight 1 from {0.01, 0.1, 1, 5, 10}.The weight of random noise 0.2 is selected from 0.1 level by grid search with range 0 to 1.We also tune the beam size of beam search from {3, 4, 5, 6, 7} and choose 5.The dropout probability 0.5 is selected from

Sensitivity Analysis about Stacking Number K
As we mentioned in Section 2.2, our CMEncoder can be stacked into multiple layers to improve the representation of an MWP.To obtain a better understanding about the hyperparameter K, i.e., the number of stacked CMEncoders, we conduct a sensitivity analysis in Table 4.We can see that the best K for RNN-encoder is 2, just like humans often need to make multiple glimpses to refine an MWP solution.In the meantime, the best K for PLMbased encoder is 1.The potential reason is that one pre-trained language model (PLM) already has an outstanding ability to encode texts.It is thus not necessary to apply another CMEncoder to refine the encoded features.
The first problem has 4 quantities and they are all useful, which means that it requires sufficient problem understanding and mathematical reasoning to generate the right answer.Both Graph2Tree and Multi-E/D which directly connect semantics modal and quantity modal fail to extract clear representations of the problem, finally resulting in unreasonable solutions which only contain 3 quantities.
For the second problem, although Graph2Tree and The starting price of a taxi is 6 yuan, it costs additional 1.2 yuan per kilometer after 3 kilometers, how much yuan should someone pay if he/ she take a taxi for 5 kilometers?Multi-E/D utilize all 3 quantities in the problem description, they still fail to generate a plausible solution.These two cases show that our proposed encoder is able to extract more comprehensive representations from problem descriptions, eventually guiding the decoder to generate the correct solutions.

MWPs in Different Categories
Figure 1 shows the MWP examples of "Story Problem", "Algebra Problem" and "Knowledge Problem"."Story Problem" often includes a significant amount of background information like characters, objectives and behaviors."Algebra Problems" involves math notations or is composed of elementary concepts."Knowledge Problem" asks for external knowledge like geometry and number sequence.The category of each problem is determined based on keywords.The keywords of "Story" and "Knowledge" problems are selected from the appendix of (Liu et al., 2019).Inspired by them, we categorize Math23k into 3 subsets -story, algebra, and knowledge.The statistics of these problems are shown in D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Figure 1 :
Figure 1: Examples of different types of problems in MWP solving.

Table 1 :
Math23k column shows the results when evaluating on the public test set of Math23k, while the Math23k * column shows the result of 5-fold cross validation on Math23k dataset.The last column #E denotes the number of parameters in encoders.

Table 2 :
Accuracy among different ablated models.

Table 5 :
Statistics of different types of problems inMath23k.

Table 5 .
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.