Learning by Analogy: Diverse Questions Generation in Math Word Problem

Solving math word problem (MWP) with AI techniques has recently made great progress with the success of deep neural networks (DNN), but it is far from being solved. We argue that the ability of learning by analogy is essential for an MWP solver to better understand same problems which may typically be formulated in diverse ways. However most existing works exploit the shortcut learning to train MWP solvers simply based on samples with a single question. In lack of diverse questions, these methods merely learn shallow heuristics. In this paper, we make a first attempt to solve MWPs by generating diverse yet consistent questions/equations. Given a typical MWP including the scenario description, question, and equation (i.e., answer), we first generate multiple consistent equations via a group of heuristic rules. We then feed them to a question generator together with the scenario to obtain the corresponding diverse questions, forming a new MWP with a variety of questions and equations. Finally we engage a data filter to remove those unreasonable MWPs, keeping the high-quality augmented ones. To evaluate the ability of learning by analogy for an MWP solver, we generate a new MWP dataset (called DiverseMath23K) with diverse questions by extending the current benchmark Math23K. Extensive experimental results demonstrate that our proposed method can generate high-quality diverse questions with corresponding equations, further leading to performance improvement on Diverse-Math23K. The code and dataset is available at: https://github.com/zhouzihao501/DiverseMWP


Introduction
Solving Math Word Problem (MWP) aims to infer a mathematical equation and final answer from the natural language description of a math problem.Table 1(a) shows one typical MWP example.In this  (Kumar et al., 2022), (c) MWP with diverse questions generated by our method.The questions are highlighted by red color in the texts of (a) and (b).
task, the machine needs to extract relevant information from natural language texts and perform mathematical reasoning, which is challenging.With the boom of deep neural networks (DNN), the research of solving MWP has recently made great progress.For example, Seq2Seq models (Wang et al., 2017;Xie and Sun, 2019;Zhang et al., 2020a) as well as pre-trained language models (PLMs) (Tan et al., 2021;Li et al., 2022b;Liang et al., 2022) have been extensively exploited to deal with MWP, and increase the prediction accuracy significantly.However, such models are usually in lack of the ability of learning by analogy due to the limited data size and problem diversity.Therefore, current approaches unfortunately have reached their performance bottleneck (Zhang et al., 2019;Patel et al., 2021;Liu et al., 2021a;Sundaram et al., 2022), showing that much remains to be done.
To alleviate this limitation, recent focus has been put on how to augment high-quality data for MWPs.Along this line, there have been some proposals (Jen et al., 2021;Kumar et al., 2021;Liu et al., 2021a;Li et al., 2022a;Kumar et al., 2022).Though demonstrating encouraging results, these current practices only consider word-level or sentence-level alternative expressions of the original problem, owing to the rigorous requirement in logic and numerical quantity.As illustrated in Table 1(b), the back translation augmentation method (Kumar et al., 2022) generates less diverse data sharing very limited semantic differences from the original counterpart.On the other hand, Yang et al. (2022) publish a diverse MWP dataset (called UnbiasedMWP), which was collected by manual annotation with huge cost but the size is limited.
In this paper, we make a first attempt to solve MWPs by automatically generating multiple diverse yet consistent questions (together with their corresponding equations), as illustrated in Table 1(c).There are two main reasons for this augmentation strategy.( 1) Training on less diverse data would lead the solver to learn shallow heuristics only, whilst deep semantics are preferred in order to better understand the problems (Patel et al., 2021;Li et al., 2022b;Yang et al., 2022).Consequently, when the question is changed (i.e., Question1,2,3 in Table 1(c)), the learned solver may not be able to solve MWP properly.(2) Our augmentation strategy could generate challenging and diverse MWPs.Training on such data would improve the ability of learning by analogy, which is essential for an MWP solver to deeply understand the problem.It is also beneficial to reduce the unreasonable case (Patel et al., 2021) that some current solvers still can predict the Equation even without any question (e.g., removing the question in the text of Table 1(a)).
Motivated by these findings, we propose a Diverse Questions Generation Framework (DQGF) to generate high-quality and diverse questions with their corresponding equations for a given MWP.Our DQGF consists of three components as shown in Figure 1.(1) Diverse Equations Generator: It generates diverse and meaningful equations from the original MWP based on two generation strategies.Specifically, we propose a subequation based strategy that extracts sub-equations from the original equation, and a unit based strategy that generates equations according the units (e.g., "dollars" in Table 1) in the scenario description.
(2) Equation-aware Question Generator: Given a scenario description and generated equation, it generates a corresponding question.For example, given the Scenario description and Equation1 in Table 1(c), it can generate Question1.In details, we utilize two encoders to extract the information of scenario description and equation respectively, and design an interaction mechanism which exploits numbers as a bridge to fuse the information of both encoders.(3) Data Filter: A large-scale MWP pre-trained language model (Liang et al., 2022) is leveraged to filter unreasonable data.As such, we can generate many high-quality and diverse MWP samples.
Extensive experiments on the existing dataset UnbiasedMWP (Yang et al., 2022) show that our proposed DQGF could generate high-quality diverse questions with corresponding equations, thus increasing the accuracy of the MWP solver.To further verify the effectiveness of the DQGF, we produce a new dataset (called DiverseMath23K) with diverse questions from the current benchmark dataset Math23K (Wang et al., 2017).We also propose a new Group-accuracy metric on all questions of a problem.Experimental results show that DQGF can effectively improve the overall performance of the solver on DiverseMath23K, demonstrating its ability of learning by analogy.In summary, our contributions are as follows: • We propose a novel diverse questions generation framework (DQGF) to automatically generate diverse questions with their corresponding equations for a given MWP.To the best of our knowledge, this is the first effort to generate such data in MWP.
• We propose a Diverse Equations Generator, consisting of sub-equations based and unit based strategy to generate diverse and meaningful equations from the original MWP.
• We propose an Equation-aware Question Generator to generate a question from the given scenario and equation.It consists of two encoders to encode scenario and equation respectively where an interaction mechanism is developed to fuse the information.
• We produce a new MWP dataset (called Di-verseMath23K) with diverse questions by extending the current benchmark Math23K.
• Experimental results demonstrate that DQGF could generate high-quality diverse questions and improve effectively the overall performance of the MWP solver on both Unbi-asedMWP and DiverseMath23K.MWP Solver: Recent proposals intend to solve the problem by using sequence or tree generation models.Wang et al. (2017) present a sequence-tosequence (seq2seq) approach to generate the mathematical equation.Xie and Sun (2019) propose a goal-driven tree-structured (GTS) model to generate the equation tree.This sequence-to-tree approach significantly improves the performance over the traditional seq2seq approaches.Zhang et al. (2020a) adopt a graph-to-tree approach to model the quality relations using graph convolutional networks (GCN).Applying pre-trained language models such as BERT (Devlin et al., 2019) was shown to benefit the tree expression generation substantially.Prior study (Patel et al., 2021) indicates that existing MWP solvers rely on shallow heuristics to generate equations.As such, they could not solve different questions of the same MWP well and even ignore the question.Our DQGF effectively helps the solver overcome these issues.
MWP Generation: MWP generation approaches can be divided into three categories: template-based approaches, rewriting-based approaches, and neural network-based approaches.
Template-based approaches usually follow a similar two-stage process: they first generalize an existing problem into a template or a skeleton and then generate the MWP sentences from the templates (Williams, 2011;Polozov et al., 2015).Rewriting-based approaches target the MWP generation problem by editing existing human-written MWP sentences to change their theme but the underlying story (Koncel-Kedziorski et al., 2016;Moon-Rembert and Gilbert, 2019).
Recent attempts have been focused on exploiting neural network-based approaches that generate MWPs from equations and topics in an end-to-end manner (Liyanage and Ranathunga, 2020;Liu et al., 2021b;Wang et al., 2021).Unlike these generation methods, our equation-aware question generator focuses on generating questions that are in line with the given scenario and match the given equation.Recently, Shridhar et al. (2022) have also proposed a generation model to implement this function, but main differences exist: (1) Their work focuses on generating goal-driven sub-questions without equations, which is used in prompt learning instead of a general data augmentation tool.
(2) While their generator directly concatenates the scenario and equation text sequence to encode and fuse their information, the structure of equation is much different from the scenario texts.We propose two different encoders where an interaction mechanism is designed to leverage numbers as a bridge to fuse the information.
MWP Dataset: Several datasets are proposed to evaluate the model's numerical reasoning ability (Koncel-Kedziorski et al., 2016;Wang et al., 2017;Amini et al., 2019;Miao et al., 2020).They only provide a single question to each scenario.Therefore, training and evaluating on such setting will lead that the solvers rely on shallow heuristics to generate equations (Patel et al., 2021).To mitigate this learning bias, Yang et al. (2022) propose a diverse MWP dataset (called UnbiasedMWP).However, manually collecting high-quality datasets is usually labor-intensive and time-consuming in practice.In contrast, our DQGF could automatically generate such diverse data.In this paper, we will use UnbiasedMWP to train equation-aware question generator and evaluate the whole DQGF.
Besides, we also propose a diverse MWP dataset DiverseMath23k to evaluate the MWP solver.

Methodology
Figure 1 shows the overview of the proposed Diverse Questions Generation Framework (DQGF).We firstly put the original MWP into the Diverse Equations Generator to generate diverse equations, then the generated equation and scenario description of the original MWP are fed into the trained equation-aware question generator to produce corresponding questions.In this way, we will obtain diverse questions with their equations, forming new candidate MWPs.Finally, these candidate MWPs are further filtered by the data filter.In what follows, we will introduce Diverse Equations Generator, Equation-aware Question Generator, and Data Filter respectively in Section 3.1, Section 3.2, and Section 3.3.

Diverse Equations Generator
Diverse equations generator aims to generate diverse equations from the original MWP.Our principle is to generate as many as possible logical equations.Motivated by this, we propose two equation generation strategies: sub-equation based and unit based strategy.

Sub-equation Based
The equation of the original MWP usually includes some sub-equations, which represent the necessary steps to solve the problem (Cobbe et al., 2021).For instance, in Table 1(c), "15+10" is a sub-equation of the original equation, describing a uniform's price.Therefore, we extract these sub-equations from the original equation, which are very high-quality and diverse.
Unit Based There are some physical relations between the numbers in an MWP.We could identify these relations, and then combine numbers with operators to get a new equation.Motivated by this, we propose to search the relations of numbers based on their units.Every number in MWPs has its unit.For example in Table 1, "40" has the unit "students" and "15" has the unit "dollars".We combine them in two situations.(1) Same unit: Two numbers with same unit always represent the same object.We combine them with the operator "+" to generate equations representing the totality questions like "what is the total of A and B".Besides, we combine them with "-" and "/" which represent the comparison questions like "how much more A than B" and "how many times A than B", respectively.
(2) Different units: Two numbers with different units in a MWP always represent two objects that have subordinate relations.Therefore, we combine them with "*".This strategy will generate diverse equations, though it probably brings some unreasonable equations further generating noisy MWPs.Such noisy MWPs will be filtered by the final data filter.
To be noted, both sub-equation based and unit based strategies rely on heuristic rules.Therefore, we do not need to train our diverse equations generator.

Equation-aware Question Generator
General question generation in the Question-Answering area aims to generate a question from a given passage and a specified answer (Sun et al., 2018;Kim et al., 2019;Li et al., 2019).By regarding the scenario description and equation as passage and answer respectively, we can formulate our task as a general question generation problem.Based on this, we propose an equation-aware question generator under a general encoder-decoder framework as shown in Figure 2. Specifically, we Scenario Encoder We adopt a pre-trained language model BERT (Devlin et al., 2019) as our scenario encoder.The unsupervised pre-training on large corpora makes the model capture linguistic knowledge, which provides rich textual representations.We represent the scenario S as a sequence of T tokens: S = [s 1 , s 2 , ..., s T ], and formulate the encoding process as ) where h s i represents the embedding of token s i from the encoder.Finally, the representation of scenario can be written as H s : (2)

Equation Encoder
The sequence form cannot model the structure of the equation well (Xie and Sun, 2019).Hence we transform it into an equation tree which is then encoded by a TreeLSTM (Tai et al., 2015).The equation is transformed into a binary tree representation as proposed in (Xie and Sun, 2019) and sequentialized as their pre-order traversal.Thus the equation can be represented as E = [e 1 , e 2 , ..., e n ], where n is the length of the pre-order equation and a node e i represents a number or operator (+,-,*,/).In details, we firstly adopt a BERT to encode each node: Then, we encode the equation tree by a TreeLSTM: Scenario to Equation: After BERT encodes the whole scenario text, each token's embedding has the scenario's context information.For a number appearing in both scenario and equation, we replace its embedding in Equation ( 3) with its embedding in Equation ( 1).In this way, the scenario's context information is brought into the equation.
Equation to Scenario: After bringing the information of the scenario to the equation and encoding the equation tree, we put the embedding of the number in the equation back into the scenario representation.In detail, we replace its embedding in Equation ( 1) with its embedding in Equation (4).
Decoder We adopt the pre-trained language model BertGeneraiton (Rothe et al., 2020) as our decoder.Representing a question Q as a sequence of m tokens: Q = [q 1 , q 2 , ..., q m ], the token q i is generated as where H is the final representation of the scenario and equation by the concatenating the H s and H e as To be noted, all of these pre-trained models in both encoders and decoders will be fine-tuned in the MWP dataset.

Data Filter
Filtering out detrimental augmented data can improve the quality of data as well as the downstream performance (Le Bras et al., 2020).However, it will take a great cost to do it by the human filtering due to the large-size of our augmented data.Therefore, we utilize an existing powerful MWP solver as an expert model to judge whether the predicted answer is same as the ground-truth (Axelrod et al., 2011;Xie et al., 2021).Inspired by Ou et al. (2022), we leverage a large-scale MWP pre-trained language model MWP-BERT (Liang et al., 2022) as our expert model, utilizing its powerful generalization ability.
Considering our generated MWPs have many new diverse questions, it is difficult for an existing solver to predict the answer accurately, resulting in many false filtering cases.To increase the recall on the generated samples, we apply beam-search strategy on the expert model to select top k predicted equations (We set k = 5 in our experiments).
Since the final answer can be from different solutions (Yang et al., 2022), we compare the answer calculated by equations instead of comparing equations directly.The augmented MWPs will pass our final filter if its final answer is equal to one answer from the selected top k equations predicted by the expert model.

Dataset and experimental setting
Dataset We conduct experiments on an existing diverse questions dataset: UnbiasedMWP (Yang et al., 2022), which is split into 2,507, 200, 200 MWP groups for training, validation, and testing, respectively.Each group contains one original MWP and additional 1 to 8 diverse questions and equations with the same scenario.In total, it has 8,895, 684, 685 MWPs for training, validation, and testing, respectively.In this paper, we train our Equation-aware Question Generator and evaluate the whole DQGF on it.
Evaluation Metrics For the whole DQGF, we apply the accuracy of a MWP solver to evaluate the quality of generated data.Without loss of generality, we choose GTS (Xie and Sun, 2019) with BERTEncoder (Devlin et al., 2019) as the MWP solver.Furthermore, we also propose a metric of Group-Accuracy to consider the prediction accuracy on all diverse questions in a MWP.For example, in Table 1(c), the normal accuracy simply regards it as three samples by evaluation of each question separately, while our Group-Accuracy consid- ers this as only one sample and if all three equations are predicted correctly then the prediction is correct.
Comparing to the common accuracy, the proposed Group-Accuracy can evaluate an solver whether truly understanding an MWP with the ability of learning by analogy.For the equation-aware question generator, we report BLEU (Papineni et al., 2002), ROUGE-1, ROUGE-2, and ROUGE-L (Lin, 2004) which are based on exact word overlapping.BERT F1 score (Zhang et al., 2020b) is also used, which is based on DeBERTa (He et al., 2021).

Experimental Results
We evaluate the quality of generated data by the results of a common MWP solver on both accuracy and group-accuracy.In details, we train the MWP solver on three different data: the original data of each group in the UnbiasedMWP (called Unbiased-source), our generated MWPs data from the UnbiasedMWP (called Unbiased-DQGF), and ground-truth MWPs in the UnbiasedMWP (called Unbiased-GT).Notably, the Unbiased-source only has MWPs with single question, while the latter two have MWPs with diverse questions.Since the Unbiased-GT directly uses the annotated diverse questions, its performance can be regarded as the up-bounded of the generation method.The results are shown in Table 2.
As shown in Table 2, we can see that training on the data augmented by DQGF can significantly improve the accuracy of solver from 34.9% to 62.7%.It indicates that DQGF can generate high quality MWP samples, which are useful for the training of a solver.In addition, the group-accuracy is also increased largely from 29.5% to 42%, even higher than the common accuracy (34.9%) of Unbiasedsource, showing that our method can generate MWP samples with valid diverse questions to help the solver better understand the problem by captur-  ing the ability of learning by analogy.Comparing the Unbiased-DQGF and Unbiased-GT, we can see that there is still a gap between our method and the manual labelling data.Manual annotation method can produce more diverse and completely correct data, which leads to the better performance.

Fine-grained Analysis
In this section, we will show the performance of the three components in DQGF individually.
Diverse Equations Generator Table 3 shows the comparison results among different equations generation strategies.As observed, each strategy can generate high quality and meaningful diverse equations.Concretely, the same unit based generation strategy brings the most benefit to DQGF because such strategy can generate a lot of meaningful but less noisy equations.The sub-equations based strategy and different units based strategy can also effectively generate meaningful equations, but with little improvement to the solver.There are two reasons: 1) The sub-equations based strategy can not generate enough equations since the sub-equations in the original equation are limited; and 2) The different units based strategy generates meaningful equations while bringing many noisy equations, which are thus hard to be filtered completely.
Equation-aware Question Generator We compare one baseline method that directly concatenates the scenario and equation text sequence (Shridhar et al., 2022) and utilizes BERT (Devlin et al., 2019) as encoder, and BertGeneration (Rothe et al., 2020)  as decoder.Table 4 reports the comparison of the different questions generator models.We can see that EQG(w/o)IM improves the performance of baseline method.It indicates that the scenario encoder and equation encoder can better encode the structure of scenario and equation respectively than directly encoding their concatenated sequence.By integrating the interaction mechanism (IM), we can observe that it leads to a great improvement, achieving the best performance on every metric, which demonstrates that our interaction mechanism can fuse the information of scenario and equation well.
Specifically, the BLEU score is 60.5% which is not high; this is however explainable as it is a metric about text overlap.As observed, though semantically identical, some of our generated data is less overlap with the ground truth.This can also be reflected by its higher BERT F1 score which measures the semantic similarity.
Data Filter We examine the effect of beamsize k of the filter in DQGF, which is shown in Figure 3.The experimental results show that DQGF can obtain the best performance when k is 5. DQFG can achieve good performance when k is between 4 and 6, since this appears to be a suitable interval in that a lot of correct candidates can pass the filter.When k is between 1 and 3, filtering is still accurate but some correct data are filtered out.Therefore this interval can achieve competitive but not the best performance.When k is between 7 and 8, the filtering is inaccurate.It causes that some noisy data pass the filter and impacts the final data quality.

New MWP dataset DiverseMath23K
We

Results
We compare the performance of the solver training on original Math23k and Diverse-Math23k.In addition to the accuracy and Group-Accuracy, we report the Deq-Accuracy (Patel et al., 2021), which is a metric measuring the question sensitivity.The lower the Deq-Accuracy, the better the question sensitivity.Concretely, it measures the accuracy that the solver predicts the answer of a MWP by deleting questions (i.e., only input scenario).A better solver should have higher question sensitivity, thus a lower Deq-Accuracy is expected.
The results are shown in Table 5.We can see that the accuracy can be improved from 63.6% to 68.4%, and Group-Accuracy is boosted from 56.9% to 60.2%.These results indicate that Di-verseMath23k can enable the model to better understand MWPs and improve its ability to solve different questions in the same scenario, even our In the future, we will focus on optimizing the model in the solver to improve its ability of learning by analogy and increase the group accuracy on the MWPs with diverse questions.

Limitations
Our DQGF still exists some limitations.While our generated data improves performance in diverse questions settings, there is still some noise in the generated data that affects the performance of original single question.In the following, we will give the limitations of our DQGF on its three components.
The diversity of the question depends on the diversity of the equations.Our equation generator is based on heuristic rules, resulting that the generated equations are very simple.In the future, we will try a model based equations generator to generate more diverse equations.In the question generator, it can only recognise equations with the operator "+-*/" due to the limited operator set in our training dataset UnbiasedMWP.In the future we will expand the operators so that the generation model can recognise more operators and be more universal.Filtering strategy is also important.Using the answers of expert model as a criterion for evaluation still exists bias and leads to the noisy data.In fact, we have tried to generate more diverse equations but all are filtered by the current data filter.We will look for better filtering strategies in the future.

Figure 1 :
Figure 1: An overview of DQGF.Each generated equation from the Diverse Equations Generator and scenario description of the original MWP are fed into the trained Equation-aware Question Generator to generate corresponding questions.In this way, we will obtain diverse questions with their equations and form a new MWP.Finally, the candidate MWPs are further filtered using the Data Filter.

Figure 2 :
Figure 2: Equation-aware Question Generator )where C (i) represents the index set of child nodes of e i .Finally, the representation of equation can be written as H e :H e = [h e 1 , h e 2 , ..., h e n ] .(5)InteractionMechanism In order to generate a question based on both scenario and equation, the interaction between them is crucial.Inspired by iterative deep learning(He and Schomaker, 2019;Schick and Schütze, 2021), we propose an interaction mechanism which uses numbers as bridge to fuse the information of both scenario and equation.It consists of the following two processes.

Figure 3 :
Figure 3: Different beamsize k of expert model in Filter.

Table 1 :
Examples of math word problem (MWP) generation by different methods.(a) original MWP, (b) MWP generated by back translation method

Table 3 :
Comparison of different equations generation strategies.

Table 4 :
Comparison of the different questions generator models.The baseline directly concatenates the scenario and equation text sequence.EQG means the Equationaware Question Generator, while EQG(w/o)IM means removing Interaction Mechanism.

Table 5 :
Performance of solvers training on different data.Ori and DQGF means the original Math23k and DiverseMath23k, respectively.

Table 6
Prediction Results Analysis Table7reports the prediction result of solvers trained on different data.The solver trained on the original Math23k can correctly solve Question1, which has a similar MWP in training.However, it cannot solve Question2, which is simpler than Question1.Moreover, it cannot solve other questions like Question3.It indicates that the solver merely learns shallow heuristics but failing to understand the MWP.When trained on DiverseMath23k, the solver would gain the ability of learning by analogy, i.e., the solver could solve different questions even if the question is changed (see Question2, and Question3).5 Conclusion and Future WorkIn this paper, we explore the ability of learning by analogy for MWP solvers.To do this, we propose a 11098 diverse questions generation framework (DQGF) to automatically generate diverse questions with their corresponding equations for a give MWP, which consists of Diverse equations Generator, Equationaware Question Generator and Data Filter.Based on the trained DQGF, we further produce a new MWP dataset (DiverseMath23K) with diverse questions.Experimental results demonstrate that DQGF could generate high-quality diverse questions and improve effectively the overall performance of the MWP solver.