Mathematical Word Problem Generation from Commonsense Knowledge Graph and Equations

There is an increasing interest in the use of mathematical word problem (MWP) generation in educational assessment. Different from standard natural question generation, MWP generation needs to maintain the underlying mathematical operations between quantities and variables, while at the same time ensuring the relevance between the output and the given topic. To address above problem, we develop an end-to-end neural model to generate diverse MWPs in real-world scenarios from commonsense knowledge graph and equations. The proposed model (1) learns both representations from edge-enhanced Levi graphs of symbolic equations and commonsense knowledge; (2) automatically fuses equation and commonsense knowledge information via a self-planning module when generating the MWPs. Experiments on an educational gold-standard set and a large-scale generated MWP set show that our approach is superior on the MWP generation task, and it outperforms the SOTA models in terms of both automatic evaluation metrics, i.e., BLEU-4, ROUGE-L, Self-BLEU, and human evaluation metrics, i.e., equation relevance, topic relevance, and language coherence. To encourage reproducible results, we make our code and MWP dataset public available at https://github.com/tal-ai/MaKE_EMNLP2021.


Introduction
A mathematical word problem (MWP) is a coherent narrative that provides clues to the underlying correct mathematical equations and operations between variables and numerical quantities (Cetintas et al., 2010;Moyer et al., 1984). MWPs challenge a student from a wide range of skills such as literacy skills for understanding the question, analytical skills for recognizing the problem type and applying arithmetical operators . Table 1 shows * The corresponding author: Zitao Liu one such problem 1 where students are asked to infer the counts of chickens and rabbits.

Math Word Problem
Chickens and rabbits were in the yard. Together they had 27 heads and 86 legs. How many rabbits were in the yard?
Equations x+y=27 Solutions x=11 2x+4y=86 y=16 In this paper, our objective is to automatically generate well-formed MWPs. Such automation will not only reduce the teachers' burden of manually designing MWPs, but provide students with a sufficiently large number of practice exercises, which help students avoid rote memorization (Williams, 2011;Wang and Su, 2016).
A large spectrum of models have been developed and successfully applied in a broad area of natural question generation (NQG) (Pan et al., 2019;Li et al., 2018;Sun et al., 2018;Zhang and Bansal, 2019;Kurdi et al., 2020;Guan et al., 2021a,b) and there has been a recent movement from the NQG community towards automatic generation of MWPs (Koncel-Kedziorski et al., 2016a;Polozov et al., 2015;Zhou and Huang, 2019). For example, Koncel-Kedziorski et al. (2016a) proposed a two-stage rewriting approach to edit existing human-authored MWPs. Polozov et al. (2015) conducted the MWP generation as a constrained synthesis of labeled logical graphs that represent abstract plots.
In general, there exists a large number of NQG models representing various text data and their syntax and semantics (Pan et al., 2019). However, automatic generation of MWPs still presents numerous challenges that come from special characteristics of real-world educational scenarios as follows: (1) MWP generation models need to not only generate fluent sentences but understand the mathematical variables, numerical quantities, op-erations, and their relations. Moreover, the models are supposed to be able to generalize to unseen equations. (2) Multiple studies have found that MWPs with real-life plots help conceptual knowledge understanding, discourse comprehension and children engagement (Carpenter et al., 1980;. (3) Computerized educational assessment systems require diverse MWP results even given similar input equations, which helps prevent students from rote memorization (Deane and Sheehan, 2003).
To overcome the above challenges, in this paper, we present a novel neural generation model MaKE (short for Mathematical word problem generation from commonsense Knowledge and Equations), which aims to automatically generate coherent and diverse MWPs from given equations in students' real-life scenarios. More specifically, to fully understand the mathematical variables, numerical quantities, operations, and their relations, equations are transformed into an edge-enhanced Levi graph. We adopt the gated graph neural networks (GGNNs) to learn representative embeddings from the equation based symbolic Levi graph. Meanwhile, the same procedure is applied to the external commonsense based knowledge graph (CSKG), which helps generate topic-relevant and semantically valid sentences in real-life settings. We choose to use the conditional variational autoencoder (VAE) framework to generate MWPs from diversity promoting latent states. Furthermore, in the decoding stage, we develop a self-planning module to dynamically select and fuse information from both equations and commonsense knowledge, which improves syntax structure of generated MWP sentences. Overall this paper makes the following contributions: • We propose a GGNN based conditional VAE model for MWP generation. To the best of our knowledge, we are the first to introduce the combinational architecture of GGNN and condition VAE for MWP generation.
• We design a novel self-planning decoding module to wisely fuse information from equations and commonsense knowledge with implicit schedule, which helps generate semantically valid MWPs.
• The proposed model achieves the SOTA scores and outperforms existing methods by a significant margin on real-world educational MWP datasets from both automatic machinery and human evaluation metrics.
2 Related Work

Natural Question Generation
Previous research has directly approached the task of automatically generating questions for many useful applications such as augmenting data for the QA tasks (Li et al., 2018;Sun et al., 2018;Zhang and Bansal, 2019), helping semantic parsing (Guo et al., 2018) and machine reading comprehension (Yu et al., 2020;Yuan et al., 2017), improving conversation quality (Mostafazadeh et al., 2016;Dong et al., 2019), and providing student exercises for education purposes (Koncel-Kedziorski et al., 2016a). Various NQG methods are developed which can be divided into two categories: heuristic based approaches and neural network based approaches (Pan et al., 2019;Kurdi et al., 2020). The former generates questions in two stages: it first obtains intermediate symbolic representations and then constructs the natural language questions by either rearranging the surface form of the input sentence or generating with pre-defined question templates. The latter neural approaches view the NQG task as a sequence-to-sequence (seq2seq) learning problem and jointly learn generation process in an endto-end manner (Yao et al., 2018;Zhou et al., 2018).

Math Word Problem Generation
Different from standard NQG tasks, generating MWPs not only needs the syntax, semantics and coherence of the output narratives, but requires understandings of the underlying symbolic representations and the arithmetic relationship between quantities. In general, MWP generation approaches can be divided into three categories: (1) template based approaches; (2) rewriting based approaches; and (3) neural network based approaches.
Template based approaches usually fall into a similar two-stage process: they first generalize an existing problem into a template or a skeleton, and then generate the MWP sentences from the templates (Williams, 2011;Polozov et al., 2015;Bekele, 2020). Deane and Sheehan (2003) used semantic frames to capture both scene stereotypical expectations and semantic relationships among words and utilized a variant of second-order predicate logic to generate MWPs. Wang and Su (2016) leveraged the binary expression tree to represent the story of the MWP narrative and composed the natural language story recursively via a bottom-up tree traversal. Template based approaches heavily rely on the tedious and limited hand-crafted tem-plates, leading to very similar generated results. This cannot meet the demand of a large number of high-quality and diverse MWPs.
Rewriting based approaches target the MWP generation problem by editing existing human-written MWP sentences to change their theme without changing the underlying story (Koncel-Kedziorski et al., 2016a;. For example, Koncel-Kedziorski et al. (2016a) proposed a rewriting algorithm to construct new texts by substituting thematically appropriate words and phrases. Rewriting based approaches are more flexible compared with templates based approaches. However, there are several drawbacks that prevent them from providing the large number of MWPs. First, the generation process is based on existing MWPs, which significantly limits the generation ability. Second, students easily fall into rote memorization since it is too trivial to notice that the underlying mathematical equations are still unchanged.
Recent attempts have been focused on exploiting neural network based approaches that generating MWPs from equations and topics in an end-toend manner (Zhou and Huang, 2019; Liyanage and Ranathunga, 2020). Zhou and Huang (2019) designed a neural network with two encoders to fuse information of both equations and topics and dualattention mechanism to generate relevant MWPs. Liyanage and Ranathunga (2020) tackled the generation problem by using the long short term memory network with enhanced input features, such as character embeddings, word embeddings and part-of-speech tag embeddings.
The closest work to our approach is Zhou and Huang (2019) and the main differences are as follows: (1) Zhou and Huang (2019) directly encode the equation by a single-layer bidirectional gated recurrent unit (GRU), while we first convert equations into Levi graph and conduct the encoding by the GGNN model; (2) instead of directly using the pre-trained embeddings of similar words given the topic, we choose to learn the topic relevant representations from an external CSKG; and (3) we choose to use the VAE framework to promoting more diverse results.

Learning from Commonsense Knowledge and Equations
Our objective is to automatically generate a significant number of diverse MWPs in students' real-life scenarios from valid equations. In addition, we support the personalized generation in which students (or teachers) can determine the story plots of MWPs by specifying topics and mapping relations between variables and entities (i.e., "x: chicken, y: rabbits", "x: apple, y: banana", etc.). A topic indicates a type of real-world scenarios, such as animals, fruits, etc. As shown in Figure 1, we adopt the encoderdecoder generation framework. The input includes a set of equations and a knowledge graph with a specific topic. We construct Levi graphs (Levi, 1942) from symbolic equations and the CSKG respectively (See Section 3.1). After that, we employ GGNNs to extract the full graph structure information about equations and real-life story plots (See Section 3.2). Then, we generate target sentence by a conditional VAE with a self-planning module (See Section 3.3). The self-planning module enables the decoder to pay different portions of attention to the equations and the CSKG.
Please note that in this paper, we focus on generating MWPs with linear equations of two variables without any constraint. Our framework can be easily generalized into MWPs with different numbers of variables with little modification.

Equation Based Symbolic Graph
The equation based symbolic graph is designed to capture the relations among mathematical variables and numerical quantities, and build connections between mathematical variables and the corresponding commonsense knowledge. In this work, we consider the linear equations (with two variables) behind the MWPs as ax + by = m; cx + dy = n, where x and y are the variables and a, b, c, d, m, and n are positive integer quantities. More equation variants are discussed in Appendix A.1.
Equations are first converted to a symbolic graph as shown in Figure 2 (a). In the symbolic graph, edge labels, i.e., Add to res, Mul, etc. representing the mathematical relations play important roles in the MWP generation, where "Add to res" indicates addition operation to the result operand and "Mul" indicates multiply operation. In order to well capture such relations, we model the edge labels as explicit nodes. Following previous work in Beck et al. (2018), we transform the symbolic graph into its equivalent edge-enhanced Levi graph (Levi, 1942)   the relation and one represents the reverse. By adding reverse nodes, we encourage more information flow from the reverse direction, in the same way, RNN-based encoders benefit from right-toleft propagation. Furthermore, we explicitly add self-loop edges to each node in the Levi graph. The symbolic Levi graph is depicted in Figure 2 (b). More details on Levi graph transformation can be found in Appendix A.2.

Commonsense Based Knowledge Graph
In order to generate valid questions in students' real-life scenarios, we utilize explicit knowledge from a self-derived CSKG specifically designed for MWP generation. We have to admit that our CSKG is of particularly tiny size compared to publicly available knowledge graphs like ConceptNet and Wikipedia. However, we have exclusive relationships that can be utilized for MWPs generations, i.e., (apple, has unit of measurement, pound), (banana, has price unit, yuan), (chicken, has feet number, 2), etc. These commonsense knowledge triples are extracted from MWP texts in a semi-automatic manner. Specifically, for each MWP, we first apply the part-of-speech tagger from Stanford CoreNLP 2 with some heuristic rules for automatic commonsense knowledge extraction. Furthermore, because the generation process requires high-quality commonsense information, we ask crowd workers to verify the extracted results, which includes both entities and the corresponding attributes. For example, for the MWP shown in Table 1, the autoparsed entities are Chickens and Rabbits and the extracted relations are (1) a belong_to relation showing that the Chickens belong to livestock; and (2) a has_head_entity relation showing that the Chick-ens have head entity head. The similar relations about Rabbits are extracted as well. We form these triples into our commonsense knowledge graphs. Moreover, we explicitly select a few entities from the above extraction process as "topics" and these topic terms can be revised by the crowd workers if they are mis-extracted. The topic entities are obtained from a given K-12 educational vocabulary. Figure 2 (c) illustrates a sample of a CSKG with a topic of Livestock.
With the help of CSKG, students or teachers are able to set their own preferences when generating MWPs by choosing different topics, such as zoo, transportation, etc. This external commonsense knowledge provides additional background information that improves the generated results diversity. Moreover, the CSKG improves the generation quality by alleviating ill-informed wordings or sentences. For instance, in spite of no grammatical errors, it makes no sense to have "rabbits live in the ocean" or "apple has two feet". Similar to the Levi graph construction procedure in Section 3.1.1, we introduce additional nodes for relations in CSKG and add reverse and self-loop edges. The CSKG Levi graph is shown in Figure 2 (d).

Gated Graph Neural Encoding
Following the success of GGNN models (Beck et al., 2018;Ruiz et al., 2019), we use GGNNs to capture both the mathematical relations among variables and quantities and the real-life associations among entities in the MWPs. Specifically, let G = {V, E} be an edge-enhanced Levi graph where V and E are the sets of nodes and edges. Let a v,u be the similarity between node v and node u from its row-wise normalized adjacent matrix. Given an input Levi graph G that may represent either the equations or the CSKG, the basic recurrence of the GGNN model is defined as follows: is the set of neighbor nodes for v and σ is the sigmoid function. is the component-wise multiplication function and z v t and r v t are gating vectors.
Let G 0 = [g 1 0 ; g 2 0 ; · · · ; g |V| 0 ] be the initial word embedding matrix of all the nodes and G n be the matrix of representation of node embeddings from the above GGNN model after n iterations, i.e., G n = [g 1 n ; g 2 n ; · · · , g |V| n ]. Similar to He et al. (2016), we ease the downstream learning tasks with embedding augmentation. We apply a linear transformation on the concatenation of G 0 and G n , i.e., Such augmented node representations contain abstract context information, which are used in our language generator in Section 3.3. Let G e * and G k * be the augmented GGNN embeddings of the equations and the CSKG. Meanwhile we apply a mean pooling operation over G e * and G k * to get the graph-level equation representation (g e * ) and CSKG representation (g k * ).

Conditional VAE with Self-Planning
In this section, we introduce our VAE architecture with the self-planning module for the MWP generation. Our self-planning module makes dynamic fusion on the learned representations of equations and CSKG to generate the MWPs.
Let Y be the random variable representing the texts of MWPs and Z be the diversity promoting latent variable of the distribution of the MWPs. Let C be the random variable representing the conditions of both the explicit equations and the implicit CSKG learned from GGNNs. We model the MWP generation by the conditional distribution as fol- is the MWP generator and p(Z|C) is the prior net. Since the integration of Z is intractable, we apply variational inference and optimize the evidence lower bound as follows: where D KL (·||·) denotes the KL-divergence.
Following conventions, we assume both the prior net and posterior net of Z following the isotropic Gaussian distributions, i.e., p(Z|C) ∼ N (µ p , σ p I) and q(Z|C, Y ) ∼ N (µ q , σ q I). The prior net only encodes the given conditions of both the explicit equations and the implicit CSKG while the posterior net encodes both given conditions and the texts of MWPs. Both the prior net and the posterior net are built upon the GGNNs shown in Figure 1 as follows: Due to the flexibility of language, there may exist more than one reasonable expression that covers the same input but in different sequence. For example, "Chickens and rabbits were in the yard. Together they had 27 heads and 86 legs." can be rewritten as "Teacher finds 27 heads and 86 legs in the yard, in which there are only chickens and rabbits.". The former expression can be viewed as the plan of first generating commonsense sentences and then the symbolic sentences, while the latter one is first generating symbolic sentences then commonsense sentences. We capture such diversity of reasonable presentations with both latent variable Z and input graphs C. Different samples of Z will lead to different self-planning results. To start the decoding process, we initialize the hidden state , where z is sampled from the posterior net q(Z|C, Y ) ∼ N (µ q , σ q I) and the prior net p(Z|C) ∼ N (µ p , σ p I) during the training and inference procedures respectively.
At each decoding time step t, we dynamically decide the portions of input information from the equations and the CSKG respectively based on the current hidden state h t , which can keep track of the current generating state. We use the attention mechanism to conduct the self-planning between explicit symbolic equations and implicit CSKG. The dynamic self-planning module takes the decoder's current hidden state (h t ), node representations of equations (G e * ) and CSKG (G k * ) as input and outputs the context-aware planning state (c t ) of the current time step. Specifically, we compute c t as follows: ; where β t represents the self-planning distribution at time step t. The final context vector is the fusion of the symbolic and commonsense knowledge graphs. The next-step hidden state (h t+1 ) is the combination of current hidden state (h t ), self-planning context state (c t ) and the representation of where W d and b d are the linear transformation matrix and the bias term. We further generate the next word by feeding hidden state h t+1 to linear transformation and softmax layer to get the next-token probability distribution.
The final objective function consists (1) maximizing the probability of ground-truth sequence texts, which promotes the predictions generated by the posterior net and the MWP generator closer to the distribution of the gold-standard data; and (2) minimizing the KL-divergence between posterior distribution (p(Z|C, Y )) and prior distribution (p(Z|C)).

Experiments
In this work, we crawled 5,447 MWPs of linear equations from a third-party website, and each MWP consists of two unknown variables and two equations. It covers 119 topics and the average length of an MWP is 62 words. In one CSKG, the average number of entities is 17.067 and the average number of edges is 29.102. We randomly select 544 of them as our validation set, and 546 of them as our gold-standard test (GT) set. Please note that different from previous work of automatically solving the MWPs such as MAWPS (Koncel-Kedziorski et al., 2016b) and MathQA (Amini et al., 2019), we focus on the generation task of MWPs of more than one linear equations in students' real-life scenarios by using topics in our CSKG. Both MAWPS and MathQA datasets do not contain MWPs that have two or three equations and variables. Furthermore, there is no explicit topics associated with the MWPs in these publicly available MWP datasets.
Meanwhile, we conduct two human evaluation studies to comprehensively evaluate the quality of the generated MWPs. First, we ask three evaluators to rate from the following aspects ranging from 1 to 3: (1) Equation Relevance: how relevant is MWP with respect to the input equations? (2) Topic Relevance: how relevant is MWP with respect to the given topic? and (3) Language Coherence: whether the MWP is coherent and well-organized. We use the average scores from three human evaluators as our final results.
Before training, in order to ensure that each question is answerable, we first use sympy's equation solver 3 to solve all the algebraic equations in the  We compare our MaKE against several strong baselines: (1) the template based method, i.e., Template; (2) conditional VAE that captures the diversity in the encoder and uses latent variables to learn a distribution over potential intents, i.e., Please note that we do not select rewriting based approaches as the baselines in this work. This is because rewriting based approaches require a very large pre-stored question bank and it only works when the input equations are matched in the question bank.

Results and Analysis
Evaluation Results on GT Set. Results on the GT set are listed in Table 2, which shows that our MaKE outperforms all baseline methods in terms of both automatic and human evaluation metrics. Specifically, from Table 2, we find: (1) comparing MaKE and Template, Template doesn't perform well in language coherence and topic relevance. This is because the MWP templates are stereotyped. Mismatches between the template context and the re-filled words lead to incoherent texts; and (2) comparing MaKE and seq2seq baselines, with rich representations of equations and CSKG, MaKE is able to better capture mathematical relations and improve MWP quality with real-life plots under the given topic. Turing Test Results on GT Set. For each existing MWP in the GT set, we generate a new MWP of the same equations but with a different topic. We show such pairs to the human evaluators and ask them to distinguish which one is the generated MWP. We measure the results of this artificial "Turing Test" via Fool Ratio, i.e., the fraction of instances in which a model is capable of fooling the evaluators. Ideally, perfect MWP generation will lead to random guesses and the ideal Fool Ratio would be 50%. Finally, we get an averaged Fool Ratio of 39.38% (36.08%, 42.49% and 39.56% from three annotators respectively). This demonstrates that the generation quality is 78.76% (39.38/50) as good as the quality from human teachers. UniLM There are small boats and big boats in the competition. There are 6 sitting in big boats, 8 big boats. On the scene, one is more than the big boat with 8 people, there are sitting in small boat, 64 people for total.
Transformer In order to reward the students who did well in this test, Teacher Fang decided to take 4 and 6 with a total of 64 people to go boating on the weekend! A small boat can seat 4 people, and a big boat can seat 8 people. Teacher Fang rents 6 more small boats than the big boats. How many small boats does Teacher Fang rent?
MaKE A company has two types of boats, and the number of big boats is 6 more than that of small boats. Each small boat can accommodate 4 people and each big boat can accommodate 8 people. When all boats are filled up, the number of people in the small boats is 64 less than that in the big boats. How many big boats are there? put the CSKG to the model, get rid of the equationbased symbolic graph and leave the other components unchanged. For MaKE w/o CSKG, we retain symbolic graph in the input and discard CSKG. For MaKE w/o planning, the input remains unchanged, but the decoder becomes a normal GRU and the attention score is computed for all nodes in symbolic graph and CSKG simultaneously. Table 2 shows the results of ablation study. Without the self-planning module, we observe that the model's self-BLEU performance has decreased, which empirically supports our assumption that the design of self-planning module can capture the flexibility in the language. Meanwhile, the performance of our model drops by 1.71% in BLEU-4, 0.75% in METEOR, 0.02% in ROUGE-L and 2.259% in Self-BLEU, which also proves the effectiveness of self-planning module. The MaKE w/o CSKG approach achieves the best Self-BLEU score but the worst human evaluation scores, which indicates that the representations of CSKG help form valid MWPs in real-life scenarios. This is because we utilize CSKG as a commonsense constraint on the generated MWPs, which results in a limited number of words that can be generated by the MaKE method under that condition. When we remove such constraint in MaKE w/o CSKG, the model only needs to satisfy the symbolic equation conditions, regardless of what the topic is or which entity the unknown variable corresponds to. Hence the search space for words will become larger, which will directly increase the Self-BLEU score. However, it has the drawback that may cause the generated texts to violate the commonsense knowledge. The MaKE w/o symbolic shows a sig-nificant decrease on all the automatic evaluation metrics except for the topic relevance score, which is reasonable since understanding of the mathematical variables, numerical quantities, operations, and their relations is essential in generating logical coherent MWPs.
Equations: x=y; 2x+4y=48; Topic: Livestock Entities: x: Chicken; y: Rabbit; 1. Rabbits and chicken are in one cage. The number of rabbits is 0 less than that of chickens. They have 48 legs in total. How many rabbits and chickens in cage?
2. There are the same number of chickens and rabbits in the yard, and the total number of legs is 48. How many rabbits and chickens are in the yard?
3. Chicken and rabbits are in the same cage. Xiaoming counted the number of heads of the two animals and found that the number of chicken heads was 0 more than the number of rabbit heads. There are 48 legs in total. May I ask how many chickens are there? 4. Xiaojun is very good at math, but today there is a difficult problem for him: A chicken has 1 head and 2 legs, and a rabbit has 1 head and 4 legs. There are chickens and rabbits in the same cage, and the number of chickens is equal to the number of rabbits. There are 48 feet in total, so how many chickens and how many rabbits are there? Qualitative Case Study. Because of the GGNN encodings of equations, our MaKE model is able to handle a wide range of mathematical relations, including both addition and subtraction, i.e., a, b, m, c, d and n may be either positive or negative in ax + by = m; cx + dy = n. We quantitatively compare the generation quality of MaKE with other baselines and the results are shown in Table 3. Fur-thermore, we show the diverse results of MaKE qualitatively in Table 4. Additional examples can be found in Appendices A.5 -A.6. As we can see, (1) CVAE and Transformer cannot interpret the equations correctly and fail to generate desired MWPs; (2) our MaKE approach is able to generate diverse enough MWPs in real-life scenarios. Large-scale Human Evaluation Results. Besides evaluations on the GT set, which is usually limited in educational scenarios (Xu et al., 2019;Wang et al., 2020), we conduct evaluations on the largescale generated results. We randomly create 100 valid linear equations and ensure that none of them appears in our training set. Meanwhile, we select top 30 common real-life topics. For each pair of equation and topic, we generate 5 MWPs accordingly and therefore, we obtain 15,000 MWPs. We conduct a human evaluation to assess the quality of these generated MWPs and the results are shown in Table 5. We can see that our method outperforms baseline models by a large margin.  Error Analysis. To better understand the limitation of our approach, we manually review 150 equations and the corresponding generated MWPs. The two major problems are: missing information and language disfluency. We show two representative examples in Table 6. In the example of missing information, the information that the small boat can accommodate 2 people and big boat can accommodate 4 people are ignored because some MWPs in the training set often ignore this "preleaned" knowledge like chickens have two legs. Language disfluency problem is introduced due to the limit size of training data under certain specific topic. This can be alleviated or addressed by either collecting more MWP training data or provide more information in CSKG to explicitly control the context of the generated text, such as the fact that livestock often live on farms and marine animals are found in the ocean, etc.
Equations: x-y=6; 2x-4y=10; Topic: Rowing boat Entities: x: Small boat; y: Big boat; Missing information Teacher Mr.Huang and his 35 students come to row the boat. They find 6 more small boats than big boats. There are 10 more students in the small boat than in the big boat. How many big boats are there?
Equations: x-y=1; 6x-8y=0; Topic: Insects Entities: x: Cockroaches; y: Ants; Language disfluency There are two types of heads: cockroaches and ants. Cockroaches have 1 more head than ants, and cockroaches have 0 more than ant legs. How many cockroaches and ants respectively? Table 6: Illustrative examples that demonstrate the typical problems of the current system.

Conclusion
In this paper, we presented a neural encodingdecoding architecture for MWP generation. Comparing with the existing NQG algorithms, the advantages of our MaKE are: (1) it extracts intrinsic representations of both the equation based symbolic graph and the CSKG; (2) it automatically selects and incorporates information from equations and knowledge graphs during the decoding process; and (3) it is able to generate relevant, coherent and diverse MWPs in students' real-life scenarios. Experimental results on real-world educational MWP data sets demonstrate that MaKE outperforms other SOTA NQG approaches in terms of both automatic evaluation metrics and human evaluation metrics.
In the future, we plan to explore the MWP generation problems for more mathematical variables with high-order operations, and explore the method to incorporate commonsense knowledge from publicly available CSKG like ConceptNet or Wikipedia.

A.1 Equation Variants
In our scenario, the general expression formula can be formed as follows: where η * ∈ {+, −, ×, ÷} are operators in equations, and ϕ * are numeric numbers. To be noticed, only one of the operators between η 0 (η 3 ) and η 1 (η 4 ) may be a minus operator, or neither. The equation variants are derived from different configurations of operators η * . We go into detail for eq.(1) discussing all its possible variants, eq.(2) keeps the same behavior as eq. (1), and different combinations of eq.(1) and eq.(2) will lead to different system of linear equation in two unknowns. According to the presence or absence of the operator η 2 and numeric number ϕ 3 , we show two different possible symbolic graph structures in Figure 3. In Figure 3 (a), ϕ 2 is equal to "m", η 2 and ϕ 3 are empty. Thus we connect node a and node m with relation "Minuend to res"; connect node b and node m with relation "Subtrahend to res"; representing ax and by are minuend and subtrahend element in "ax-by=m" respectively. In Figure 3 (b), ϕ 2 , η 2 and ϕ 3 are equal to "c", "+" and "d" respectively. In order to be consistent with the graph structure described in Figure 3 (a), we first add a dummy node dum in our symbolic graph, then connect node c and node dum with relation "Add to dummy", and connect node d and node dum with relation "Add to dummy". In this way, the dummy node can represent the expression "c+d".

A.2 Levi Graph Transformation
Let G = {V, E, R} be a directed symbolic graph with nodes v i ∈ V and labeled edges (v i , r, v j ) ∈ E. As shown in Figure 2 (a), where r ∈ R is a relation type, i.e., Add to res, Mul, etc. Let |V| and |E| denote the number of nodes and edges, respectively. We convert the graph G into an unlabeled and directed bipartite graph G t = {V t , E t } with levi transformation by converting each labeled edge (v i , r, v j ) ∈ E into two unlabeled edges (v i , r), (r, v j ) ∈ E t , where |V t | = |V| + |E|. Intuitively, transforming a graph into its Levi graph form turns original edges into additional nodes, which allows us to directly encode edge label information with word embeddings and guarantee the relation message passing with multi-hop reasoning.

A.3 Training and Testing
We obtain non-lexical text by replacing the numbers in the question text with the pre-defined special tokens in our symbolic equation graph and CSKG. The procedures are similar to the example in Table 8 with the following differences: • Matching words for unknown variables are first extracted from the gold MWP, and query our private database with the given topic word to construct our commonsense knowledge graph.
• MaKE transforms operators η * into equation graph edge labels (relations), and numeric number ϕ * into equation graph nodes v.
• More words in MWP texts are replaced with special tokens in CSKG. Take the sentence in Table 8 as an example, wheels are replaced by one node (counting entity) in the corresponding commonsense graph.
During training MaKE, the input is CSKG and the equation based symbolic graph, and the output target is the delexicalized words sequence. We apply the same word refilling post-processing procedure to obtain the final MWP.

A.4 Baseline Methods Details
Template In addition to neural baselines, we use a problem-specific, template-based generator. The template-based method first finds MWP problems with the same type of input equations in the question bank given the input topic words. For instance, the query equation is x + y = 6; 2x − 4y = 6 and Equation templates: x+y=α; bx-cy=d Topic: vehicle Query text: There are α x_entity and y_entity in the parking lot. Each x_entity has b wheels and each y_entity has c wheels. x_entity has d more total wheels than y_entity. How many x_entity are there? Query variables: x_entity: motorcycles, y_entity: cars Generated MWP There are 6 motorcycles and cars in the parking lot. Each motorcycles has 2 wheels and each cars has 4 wheels. Motorcycles has 6 more total wheels than cars. How m--any motorcycles are there? η 0 ϕ 0 xη 1 ϕ 1 y = ϕ 2 η 2 ϕ 3 η 3 ϕ 4 xη 4 ϕ 5 y = ϕ 6 η 5 ϕ 7 Input sequence expression: [η 0 , ϕ 0 , η 1 , ϕ 1 , ϕ 2 , η 2 , ϕ 3 , η 3 , ϕ 4 , η 4 , ϕ 5 , ϕ 6 , η 5 , ϕ 7 , Topic] Input sequence for given example: [pad, ϕ 0 , +, ϕ 1 , ϕ 2 , pad, pad, pad, ϕ 4 , -, ϕ 5 , ϕ 6 , pad, pad, x_entity, y_entity,vehicle] Output MWP: There are ϕ 2 x_entity and y_entity in the parking lot. Each x_entity has ϕ 4 wheels and each y_entity has ϕ 5 wheels. x_entity has ϕ 6 more total wheels than y_entity. How many x_entity are there? Table 8: Input sequence for seq2seq method, ϕ * are numeric number in equations, and η * are operators. If there is no valid operator or number for a given special token, we fill it with a pad token. the query topic is vehicle. As shown in Table 7, we first delexicalize the input equation pairs with special tokens and save them for post-processing. After query our question bank with the delexicalized equation pairs and the topic word, we obtain the pre-stored MWP template and matching words for unknown variables. Finally we fill the MWP template with the previously saved delexicalized words and obtain the generated MWP.
CVAE (Zhao et al., 2017) Similar to previous work, we apply a seq2seq model and adopt a latent variable to capture the diversity of MWPs. We replace the hierarchical encoder with a one-layer GRU, and the initial state of the decoder is the combination of a latent variable and the final state of the encoder. As shown in Table 8, we apply the delexicalization process and sequence transformation operations for all the training data. The input sequence includes special tokens, operators and the topic word. After refilling the special tokens with corresponding matching words, we obtain the final MWP.
MAGNET (Zhou and Huang, 2019) MAGNET is a previously proposed seq2seq MWP generation framework. Following the original implementation, we utilize a bidirectional RNN to encode equation sequence and encode topic word with a word rep-resentation lookup table. The decoder is a single directional RNN with equation-topic fusion mechanism to leverage both equation and topic information. We follow the same input sequence as described in Table 8, but split the equation sequence and the topic words as separate inputs.
UniLM (Dong et al., 2019) A pre-trained natural language generation model with transformer encoder and decoder blocks. We fine-tune UNILM on MWP generation task with the same input and output token sequence described in CVAE method.
Transformer (Vaswani et al., 2017) We included a Transformer-based seq2seq model which has proved its success in machine translation tasks. The same input sequence as described in previous method.

A.5 Additional MWP Generation Comparison
We provide additional illustrative examples of the MWP generation comparison with unseen equations in Table 9.

A.6 Additional Diverse MWP Results
We provide additional diverse MWP results in Table 10.

CVAE
There are many small cars and big cars in the parking lot. There are 6 people in these cars and 56 people in big cars. How many small cars and big cars are there?

MAGNET
We need to clean up a total of large cars and small cars. We know that small car can transport 6, and the number of large cars is the number of times. Please tell the number of small cars?

UniLM
Doctors have produced a lot of small cars and large cars. There are 0 cars in total. These two types of cars have 56 people. How many of these two cars have a total of 56 people, small cars?
Transformer There are 56 people in the class to go rowing, and there are 0 cars in total. Among them, the small cars have 4 people each, and the big cars have 6 people each. The small cars have (), and the big cars have ().

MaKE
A company needs to transport 56 wheels in one transportation task. We know that each large car can carry 6 wheels and each small car can carry 4 wheels. The company has completed the task with only one time transportation, and the number of used small car is twice the number of large car. Q: They used () small car.

CVAE
Chickens and rabbits in the same cage. The number of chickens is twice the number of chickens. The number of rabbits and rabbits is twice the number of chickens, and the total number of legs is 10. How many chickens and rabbits are there each?
MAGNET Chickens and rabbits in the same cage. The number of chickens is 20 more than that, and there are 10 rabbit legs. How many rabbits are there?
UniLM How many chickens and rabbits are there in the 10-legged training centre?
Transformer There were 20 chickens and rabbits. The total number of legs of the chickens was 10 more than the rabbits, and the chickens had () only.

MaKE
The number of chickens is 20 less than the number of rabbits, and the number of feet for chickens is 10 more than that of rabbits. How many chickens and rabbits are there?

CVAE
In the average daily None dormitory of the fifth grade, Teacher Chen gave four students, two small dormitories can accommodate 22 students. Two groups can accommodate 22 students, and the residence is just so big that the dormitory can accommodate 192 students. There are () rooms in the dormitory group. There are 14 students in the small dormitory, which just happened to be full.

MAGNET
There are 16 people in HongKong. The large dormitory can accommodate 22 people, and the small dormitory can accommodate 14 people. A total of 192 people, how many large dormitory?

UniLM
The school arranges accommodation for students. The big dormitory can accommodate 22 students in the dormitory, and the small dormitory can accommodate students. A total of 192 students can live in the dormitory. On average, how many rooms are there in each small dormitory?
Transformer 192 students from grade 1 to grade 6 go to the spring trip. There are 16 teachers in total. Students have two kinds of dormitories to choose. The large dormitory can live in 22 people, and the small dormitory can live in 14 people. In total, 192 dormitories are booked. What is the number for the booked small dormitory and large dormitory separately?

MaKE
The Youth Hostel is designed to accommodate 22 people in the large dormitory and 14 people in the small dormitory. One day, there were 192 travelers and the hostel was just about full with an average of 16 people per dormitory. How many large dormitories are there in the youth hostel? Table 9: Illustrative examples of the MWP generation comparison with unseen equations. () represents the question that the student needs to solve.