Automatic Generation of Socratic Subquestions for Teaching Math Word Problems

Socratic questioning is an educational method that allows students to discover answers to complex problems by asking them a series of thoughtful questions. Generation of didactically sound questions is challenging, requiring understanding of the reasoning process involved in the problem. We hypothesize that such questioning strategy can not only enhance the human performance, but also assist the math word problem (MWP) solvers.In this work, we explore the ability of large language models (LMs) in generating sequential questions for guiding math word problem-solving. We propose various guided question generation schemes based on input conditioning and reinforcement learning.On both automatic and human quality evaluations, we find that LMs constrained with desirable question properties generate superior questions and improve the overall performance of a math word problem solver. We conduct a preliminary user study to examine the potential value of such question generation models in the education domain. Results suggest that the difficulty level of problems plays an important role in determining whether questioning improves or hinders human performance. We discuss the future of using such questioning strategies in education.


Introduction
Questioning can be a valuable way of supporting student thinking.It can be conceived as a scaffold (Wood et al., 1976;Quintana et al., 2004), where a more knowledgeable tutor helps a student in solving problems otherwise too difficult.One approach well-suited for mathematics is funneling (Wood, 1994), which uses prompting questions to guide students towards a solution.
Figure 1: Math word problems can be precedurally solved in multiple reasoning steps.One operationalization of Socratic questioning is to map each step in the procedure to a question.Asking (machines/humans) the right set of questions in a certain sequence (shown in green) can be an effective way to do so.In order to be effective, the Socratic questioning should be focused and goal-driven.Figure 1 shows an example of a math word problem where this questioning strategy might be beneficial.We hypothesize that these questions can not only help humans in understanding the problem better and improve their performance but can also assist MWP solvers.
Even though question generation (QG) models have been studied for factual SQuAD-like questions (Rajpurkar et al., 2016;Puri et al., 2020), these models fail to generate sequentially-coherent questions (Reddy et al., 2019;Choi et al., 2018).Furthermore, domain-specific questioning is challenging as the QG model needs to understand the reasoning process required to provide fine-grained responses.Moreover, the role of a teacher using questioning is to interject questions that focus on the most critical points in an explanation and take the understanding forward (Anghileri, 2006).As seen in bold in the Figure 1, we refer later to these properties of questioning as focused and goaldriven.
In this work, we explore the use of large language models (Raffel et al., 2020;Radford et al., 2019) to generate guiding sub-questions for math word problems.In particular, we use reinforcement learn-ing (RL) with rewards from various sources including Math question answering (Math QA) models and various forms of input conditioning for generating these questions.We train and evaluate our models on the recently released GSM8K MathQA dataset (Cobbe et al., 2021) of multi-step reasoning MWPs.We illustrate the benefit of our RL-based generation strategy using both automatic and human evaluation metrics.Our evaluation shows that our guided approach makes the generation model ask more logically relevant and structurally correct questions, which follow the appropriate sequencing of questioning at the right granularity level.
We further show that our generated questions, when provided as additional context, can aid a math question answering model, thereby providing further empirical justification of the value of questioning for math QA model training.Questioning could facilitate reasoning of MWP solvers by making intermediate reasoning steps explicit.Finally, we explore the didactic usefulness of our questioning strategy by conducting a preliminary user study and use it to show that the generated sequence of questions may have the potential to improve students' problem-solving.However, we cautiously note that achieving this would require further progress on many fronts in AI and Education.In what follows, we begin by discussing related work and introducing our research questions in section 2 and section 3. We propose ways to induce these properties in LMs using planning and reinforcement learning in section 4; section 5 empirically demonstrates the effectiveness of inducing questioning strategy in LMs and the quality of generated questions evaluated using automatic metrics and by humans.Finally, we evaluate the potential of using such questions as an educational tool for helping students solve MWPs in section 6.

Related Work
Socratic questioning approaches have evolved within the learning sciences community into the theory of scaffolding (Wood et al., 1976;Reiser, 2004), which broadly refers to assisting students in problem-solving beyond their zone of proximal development (Quintana et al., 2004).Computerbased scaffolds (e.g., in the form of hints, prompts, feedback) have moderate effects on student learning outcomes (Kim et al., 2018), and our work can be used to automatically generate such scaffolds in the form of questioning prompts.For mathematics, Wood (1994) analyzed interactions in math classrooms and proposed two distinct interaction patterns -funneling, which functions by guiding students using leading/prompting questions to a predetermined solution procedure, and focusing, which functions by drawing student attention to the critical aspects of the problem.We draw inspiration from this strand of work.Our overall question generation approach can be conceived to be similar to funneling, with specific sub-questions focusing on the important domain concepts.
Research on question generation includes visual question generation (Fan et al., 2018;Wang et al., 2022), generation of questions for student assessment (Stasaski and Hearst, 2017;Wang et al., 2018), generation of factual questions based on Wikipedia articles (Rajpurkar et al., 2016;Ko et al., 2020) or generation of sequential informationseeking questions in dialogue-based scenarios (Reddy et al., 2019;Choi et al., 2018).Other work has also explored similar ideas of improving answerability by question-asking (Klein and Nabi, 2019;Shwartz et al., 2020;Perez et al., 2020;Pan et al., 2021) and ranking them (Rao and Daumé III, 2018).However, factual questions do not usually require much reasoning and mostly boil down to information retrieval from text.In this work, we focus on question generation for reasoning problems.
Prior work on guided and controlled question generation uses either entities as guiding mechanism (Huang et al., 2021) or reinforcement learningbased graph to sequence approach (Chen et al., 2019).Identification of entities and relationships present in the text often uses rule-based or on-shelf extraction tools, which are hard to extend (Dhingra et al., 2020).Often these single-hop questions are combined to form a multi-hop question that requires complex reasoning to solve it (Pan et al., 2021).Controllable text generation has been studied in the past for text generation (Hu et al., 2017;Miladinović et al., 2022;Carlsson et al., 2022), Wikipedia texts (Liu et al., 2018;Prabhumoye et al., 2018) and data-to-text generation (Puduppully and Lapata, 2021;Su et al., 2021).Controlled text generation is particularly useful for ensuring that the information is correct or the numbers are encapsulated properly (Gong et al., 2020).Our task has similar requirements.
A final strand of related work lies in the ballpark of math problem solvers (Hosseini et al., 2014;Kushman et al., 2014;Roy et al., 2015;Seo et al., 2015;Sachan and Xing, 2017;Sachan et al., 2017Sachan et al., , 2018, inter alia), inter alia).Recent work in this area uses specialized architectures such as graph-based encoders (Zhang et al., 2020) and tree-based decoders (Xie and Sun, 2019), and more recently, large pretrained LMs which show state-of-the-art results (Cobbe et al., 2021;Shen et al., 2021;Kojima et al., 2022;Wei et al., 2022;Chowdhery et al., 2022).Application of these approaches to the MWP datasets like GSM8K (our data context) still holds considerable room for improvement, primarily in the reasoning capabilities, and the majority of the latest approaches are still unable to solve a lot of the problems and sensitive to even slightest modifications in the problem (Patel et al., 2021;Stolfo et al., 2022;Srivastava et al., 2022) .

Research Questions
We now discuss the usefulness of questions in solving a math word problem and then study the different properties of a good questioning strategy.
RQ1: Does sub-questioning help in understanding a math word problem better?Question prompts as a teaching strategy act as instructions that guide the students throughout a problemsolving process (Wood, 1994).Such questioning, as a valid scaffolding strategy (Kim et al., 2018), is valuable in supporting student thinking and is commonplace in high-quality math instruction (Boston and Candela, 2018).We explored the sub-questioning strategy with our trained NLP model and found that sub-questioning helps answer the MWPs more effectively (Table 1).Experiments with NLP models and humans establish the useful-ness of sub-questioning in solving MWPs.
RQ2: What are the properties of a good questioning strategy?Once we established that subquestioning is helpful, we performed the same subquestioning experiment as RQ1 with NLP models but with the permuted ordering of sub-questions, change in granularity of sub-questions or changed content (Table 2).We observed a decrease in the answering capabilities of the QA model for all the cases, establishing that the right sequence of disciplined questions with relevant content is an essential component of a good questioning strategy.Based on our results and inspired by prior work (Wood, 1994;Anghileri, 2006), we hypothesize the most important components of a Socratic questioning strategy as: (A) Focused: An essential property of a good questioning strategy is to ask questions that are directed towards the most critical domainspecific content.Irrelevant questions not only make the process difficult but also force a diversion in the focus and may increase the cognitive load that a student experiences.
(B) Goal-driven: Asking the right sequence of relevant questions that can assist students in reaching the final goal (solving the main question in case of math word problems) is a further important part of good questioning.

Methodology
We discuss our approach to modeling Socratic questioning using large LMs.We begin by defining our MWP dataset D as a collection of MWPs.Each MWP P in the dataset is accompanied by its solution S and the numerical answer A. We do not always assume the existence of problem solutions S and answers A as they can be automatically derived from various MathQA models.Each MWP P = (C, Q) consists of the story context C and the question Q.The problem solution S consists of n solution steps S = (s 1 , ..., s n ).We define Socratic questioning such that each solution step s i can be mapped to a sub-question q i .We refer to q as a collection of all Socratic questions q 1 , ..., q n for a given MWP P in our work.An example MWP is present in Figure 2.
Our main module is the Question Generator (QG) module, which is a transformer (Vaswani et al., 2017) based encoder-decoder model.The QG model takes the reference Math word problem P and generates the Socratic questions q * as close to the true sub-questions q as possible.The learning objective of the QG module is as: where Enc represents the encoder and Dec represents the decoder for the seq2seq QG model.Note that the sub-questions q i are decoded word by word in an auto-regressive setting.
Next, we propose to inject the two Socratic questioning properties in our QG model as follows:

Focused questions
To learn a sequence of disciplined questions focused on specific reasoning steps in the MWP, it is important to ask the right set of questions.We propose a content planner ψ that serves as a guiding principle for the QG model to ask the right focused questions.In principle, the content planner module can extract any relevant information to assist the QG model, but for the task of math word problems, we restrict it to operators and equations.1Our planning strategies are defined as: Operators: Given an MWP P , the content planner learns to identify the operations and operators (e.g., addition, multiplication, ..) involved in the problem.Since the operators play a significant role in a given MWP, the generated operators are used as the guiding principle to generate sub-questions by the QG model.
Equations: Equations contain important information for an MWP as they involve not just the operators but also the quantities involved in the problem.Similar to operators, equations can play an important guiding principle for asking more focused questions leading towards a correct solution.
We use the same seq2seq architecture for the content planner module as our QG model, with the only difference being that the output comprises a set of equations s * 1 , .., s * n or just the operators within the equations (instead of the sub-questions).The generated operators/equations are appended to the input MWP P in the encoder for the QG module and the modified focused learning objective L QG f is: (2) Here, plan depicts the content planner module's output and ⊕ depicts the concatenation operation.

Goal-driven questions
An essential element of a good questioning strategy is to ask goal-driven questions that are not only factually associated to the main problem but also eventually help in answering the main question.However, there can be any number of goal-driven questions that can be asked for a MWP.Thus, our goal is to optimize the questioning strategy such that it is goal-driven, efficient, and rewarding at each step, making sure that the final goal can be achieved with these individual questions.We induce these properties in our QG model using various rewards that force the model to stay relevant to the problem.These rewards are defined as: Fluency: It is important that the generated subquestions are easily understandable and fluent in the meaning they represent.Although the QG training objective ensures the syntax and semantics of the questions generated, rewarding the system to stay fluent is necessary to remove repetitions and illogical questions.
Granularity: As solving a MWP usually involves multiple reasoning steps, asking relevant questions at each step can help in solving the MWP.Moreover, our questioning strategy is based on the fact that the questions are organised, structured and follow a sequence.With the granularity reward, the model can learn to ask the right number of questions (compared to the number of reasoning steps to solve MWP) in a specific sequence and refrain from unstructured questions.
Answerability: For every generated question, it is important to evaluate if the generated questions can be answered given context C and can help in answering the overall MWP.We trained an external QA model that can answer the MWPs taking help from the sub-questions and evaluated if the generated question can assist in answering the main problem.The answerability reward is provided on both a step-by-step basis (if the QA model can answer a sub-part of the main problem) and overall (if using all sub-questions, whether the final answer was correct or not).
During training, the QG model samples a set of subquestions q ′ , calculates various rewards based on q ′ .The parameters of the QG model are updated using the REINFORCE algorithm (Williams, 1992) as: The reward function [R(q, q ′ , P)] measures the individual rewards for fluency, granularity and answerability and is calculated as: where, BLEU(.,.) represents the BLEU score (Papineni et al., 2002).
, and |q| and |q ′ | denote the number of questions in q and q ′ respectively.
where, F (A, A ′ ) = 1 if the final answer from the QA model is correct when it is given sub-questions q ′ alongside the MWP P, and 0 otherwise.A ′ denotes the answer from the QA model and A denotes the true answer.
We also evaluated the step-by-step performance of the QA model on the generated sub-questions to check if the QA model can answer the generated sub-questions correctly.This allows us to provide partial rewards at each step of the generation (↓) represents the drop in the accuracy when compared to the Socratic questions (P ⊕ {q}).⊕ represents the concatenation operation.GPT-2 model was trained with and without Socratic questions while GPT-3 model (Brown et al., 2020) was prompted using one-shot example (more details in Appendix subsection B.2).
model.The modified sub-step answerability reward is , where #a ′ and |q ′ | denote the number of correct answers to the generated sub-questions and total number of generated questions respectively.

Overall Loss Function
Finally, with the induced Socratic properties in the QG model, the total loss is defined as a combination of the focused learning loss L QG f and the loss of the rewards L RL , as: where α is a weighting factor.

Empirical Analysis
We now demonstrate the effectiveness of inducing the defined questioning properties in large LMs.concept to be learnt, in the right sequence (ordering) with high granularity in their structure.We verify our hypothesis with GPT-2 model as a QA solver after fine-tuning it on the training set of the GSM8K dataset and the GPT-3 model with oneshot prompting.Table 1 demonstrates that the Socratic questioning improves the performance of the QA solver as high as 45%.Then, we vary the properties of the test questions and examine the performance of the QA Solver.Table 2 demonstrates that Socratic questions significantly improve the model performance from 5.45% to 10.46%.Subquestioning even helps when only 75% Socratic questions are retained (denoted as {q} 0.75 in the table) or when the order is shuffled (this might be an artefact of the dataset containing a minority of examples with strict order).An interesting observation is that when the number of Socratic questions is reduced by half or lower (while preserving their order), the model gets confused and performs worse than when it had no sub-questions.Finally, we take the pre-trained T5 model and without finetuning it for our task, we take the outputs and used it alongside the problem P as additional information to solve the problem.The performance goes as low as 2.57%, indicating that non-relevant information degrades the performance.

RQ2: What are the properties of a good questioning strategy?
We now present our analysis on inducing the two Socratic properties to LMs.Similar to the BLEU score, we achieve better performance on BERT F1 scores too.Finally, the number of correct question count improves with planning and doubles compared to the no-planning variant.However, results show that in all the variants the number of generated sub-questions is less than the number of reasoning steps.This could be improved further by oversampling during the beam search (beam search settings are the same for all variants in this experiment).The results degrade when the ground truth content (both equations and planning) is replaced by our content planner module.This is expected as the errors in the content planning module are cascaded when generating sub-questions.However, with more powerful models, errors in the content planner can be reduced, leading to improvement in all the metrics.See the Appendix for experiments with the iterative splitting of MWP into multiple parts for generation.

Focused generation:
Goal-driven generation: Table 4 summarizes the results for the rewards as a strategy to incentivize the model to generate goal-driven and rewarding questions.We can observe the gains associated with each reward for both the baseline model and the best-performing model from mance.This is mainly because slight improvement in sub-questions quality does not necessarily help in reaching the final goal.

Human quality evaluation
Next, we perform a human evaluation of the questions generated for 100 randomly selected test MWPs to assess the quality of our model generation (our best model) compared to the baseline (with no planning or reward-based strategies).For this analysis, we divided the questions among 4 annotators with an overlap of 40% of the questions among them 3 to evaluate the generated question quality on the following factors.A 5-point Likert scale ranging from 1 (poor) to 5 (very good) was used for each dimensions of quality assessment: repetition -whether questions are repeated, factuality -whether all questions can be solved by the information given in the problem, logical relevance -if the question is logically related to the MWP, right sequence -correct sequence of questions leading to the final answer, granularity -questions are granular enough to solve the problem but are still relevant and no retrieval or basic common sense questions are asked, completeness -questions are complete with all steps covered to reach to the final answer, and fluency -grammatical correctness and fluent in the language.Figure 3 presents our findings, clearly demonstrating that our planning and reward strategies lead to superior quality questions on the MWP task.Although both baselines and our model-generated text achieve almost full score (5) on the fluency parameter, our model-generated questions are more aligned to the MWP, thus leading to a higher score on all the other parameters.We also present a randomly selected sample of generated questions in the Appendix.

Ablation study: Manipulating question properties
Both planning strategies help generate better questions.To gain a deeper understanding of how content planner ψ affects generated questions, we further analyze the influence of operators as a planning strategy.Here, we randomize operators and their sequence and measure change in performance.Table 6 shows that the correct sequence of operators with the correct number of operators guides the generation process better than randomized versions.A gap between the correct count of operators and random count indicates that having a correct number of operators (of any type) is more valuable than the exact type of operators.We observed that the number of operators guides the model in terms of the number of questions that need to be asked, while type changes the overall quality.Needless to say, for the same number of operators, quality matters.

A preliminary user study with learners
Finally, we designed a preliminary user study to evaluate whether our generated questions, when presented as further problem-solving exercises (as typically used in educational settings) can help learners on the way to solving the overall problem.Given our research question, we hypothesized that guidance with questions can increase the overall problem-solving success rate for users in the questions (treatment) group compared to the noquestions control group.Our study uses Socratic questions as the main pedagogical intervention.We focus on participants who cannot solve a problem on the first attempt to clearly distinguish the impact of automated sub-questioning.The key metric we measure is the success rate, which is defined as the percentage of correctly solved problems.
For our study, we built a simple user interface which allowed participants to solve math word problems (see Figure 5 and Figure 6 in the appendix).The interface contained a calculator which the users could use if needed.The study comprises 5 pre-test problems and 8 problem-solving exercises.These problems were randomly selected from the GSM8K test set.Our user study with this interface was then deployed on Mechanical Turk and participants were hired using the platform and were paid 10-12$ per hour.We selected participants with moderate levels of prior knowledge using the pre-test scores as the selection criteria, and only those scoring in the range of 40-80% were selected for the study.This way, we excluded both low-prior knowledge participants and experts in our study to ensure there was a learning possibility.
We randomly split the participants into two groups -no-questions group (N = 19) with no question prompts, and questions group (N = 17) with questions generated from our model.Both groups used the same interface for solving math word problems and had the opportunity to resolve their answers after the first incorrect submission.The only difference was that after incorrectly solving a problem on the first submission, participants in the questions group saw sub-questions, while those in the noquestions group were only prompted to try again.The sub-questions were generated using the bestperforming model with planning and rewards.
The results of the user study are shown in Table 7.
The first attempt success rate is 58.4% for the control group and 66.0% for the treatment group, which might be the result of a slightly skewed prior knowledge distribution of 0.68 and 0.65 for treatment and control groups respectively.Even though participants in the treatment group (M = 124.9,SD = 92.1)spend significantly more time (p < 0.01) solving problems during the second attempt relative to the control group (M = 41.5, SD = 31.4),we did not find any statistically significant difference between the groups in the second submission success rate (p = 0.659, BF 01 = 2.755, Cohen's d = 0.157), indicating weak odds favouring the null hypothesis and rather a small effect size.
As our study was unable to establish overall performance improvements, we further analysed the second submission success rate per problem (see Figure 4), and correlated it with the difficulty of the question.This analysis indicated that subquestioning seems to improve the success of simpler problems and degrade the accuracy for relatively more complex problems.Prior work has suggested that the effectiveness of question prompts varies according to an individual's prior knowledge (Kim et al., 2018), and with insufficient prior knowledge, performance for complex problems may suffer.A posthoc inspection of the generated sub-questions for more complex problems shows that they also scored lower in the human quality evaluation.Thus, we hypothesize that for more complex questions, the generated sub-questions are not good enough, and so they may make the task more challenging for participants.
While we were not able to establish any direct benefits of automatic Socratic questioning in a real learning scenario, we leave a more complete user study for future work.Deployment of Socratic questioning systems in real educational scenarios would require a better assessment of question generation quality as well as a better understanding of learners.We believe this is an interesting avenue for future research and encourage future work to attempt to address these issues.

Conclusion
We study the importance of sub-questioning for learning a mathematical concept and explore how LMs may generate these sub-questions.We demonstrate the usefulness of Socratic questioning strategies and propose ways to induce these properties Table 7: User study success rates (in %) before after introduction of sub-questions.1st success is the proportion of exercises solved correctly on the first attempt and 2nd success is the proportion of correctly solved exercises on the second attempt (out of all incorrectly solved on the first attempt).
Figure 4: Second submission success rate for problems with at least 10% occurrence for each group (excluding the two simplest problems 1 and 6).Difficulty level is annotated blind to the correct solution.
in LMs.We further evaluate if these questions can assist students in learning domain concepts.We found that the generated questions were generic for each student and if adapted to their prior knowledge and intermediate solutions, their effectiveness could have been greater.

A discussion on limitations of our work
Our questioning strategy, although utilizes information from the content planner and the reward strategy, leaves much to be desired in terms of its controllability.Based on our user study, we need to be careful in using the questioning strategy in real educational contexts, as improper content can sometimes do more harm than good.Based on the prior work, we focused on two aspects of goodness in questioning in math education.However, this is not a complete list and other aspects could also be important.We note that our user study was focused on the intermediate success rate rather than on actual learning.From a learning standpoint, asking questions that are always easily answerable won't lead to deeper, wider learning.If learners do not have to struggle to answer the sub-questions being asked and are instead repeating something verbatim or offering a slightly reconfigured version of what they have been asked, they are probably answering sub-questions that do not require conceptual understanding.Another limitation of our work is that our user study was underpowered due to resource constraints, which prevents us from drawing strong conclusions at this point.A larger user study is however forthcoming.
Finally, we choose to focus on Socratic questioning in a rather narrow sense of trying to call learners' attention to relevant facts and then implicitly stimulating them to integrate facts and draw conclusions.However, when taken together with all its nuances, the effectiveness of Socratic questioning can be posited to depend on other critical question types that seek clarification (e.g., can you rephrase?), evidence (e.g., can you provide an example?) and implication (e.g., why do you think. . .?) from learners too, all of which are truly dialogic and naturally leave room for learner questions.When both the teacher and learners are jointly responsible for pushing the dialogue forward, intermediate success may also not always be desirable as learner errors and misconceptions may offer an important hook for the teacher to nudge the dialogue productively.

A Details of User Study
We perform a user study using Amazon Mechanical Turk.Participants which did not spend a minimum time per question were excluded from the analysis.Generated questions used in the questions group are listed in the Table 9.This also explains the #Q for iterative case to be not equal to 1 as there are some duplicates generated by the model and sometimes the split is not perfect.

B.2 GPT-3 prompting
We used one shot prompting for GPT-3 meaning we provide one example (Q,A) to the model and let it predict the answer (A) for the next question (Q) provided.
No sub-questions Problem: John has 10 hectares of a pineapple field.There are 100 pineapples per hectare.John can harvest his pineapples every 3 months.

Figure 2 :
Figure 2: Our overall methodology: Two Socratic properties of focused (red dotted box) and goal-driven (green dotted box) question generation are added to the question generation model with a combination of content planning and reward based finetuning.Here, ⊕ represents the concatenation operation.

Figure 3 :
Figure 3: Comparison of baseline versus our model generated sub-questions on several metrics from our human evaluations (showing mean and standard deviation).

Figure 5 :
Figure 5: Interface for our user study (cf.Section 6).For each problem, the first screen contains the MWP text, a calculator, and an input box to submit the answer.

Figure 6 :
Figure 6: After submitting an incorrect solution on the first attempt in the treatment group, our model generated sub-questions are shown to the participants to guide them through the problem-solving process.The control group only sees a prompt to try again.

Table 1 :
Comparison of Math QA accuracy (in %) with and without Socratic questions for GSM8K test dataset.

Table 2 :
Comparison of Math QA accuracy (in %) for different variations of experiments with ground truth data.{q} k represents that only k% of the ground truth sub-questions are used and selected randomly.For e.g., {q} 0.25 represents only 25% of the sub-questions are used.shuffle({q}) represents all sub-questions, but with shuffled order.Finally, <base-ques> are the subquestions generated from a T5 large model without finetuning on our task.(↓) represents the drop in the accuracy when compared to the Socratic questions (P ⊕ {q}).⊕ represents the concatenation operation.GPT-2 small was used as QA model for all the above experiments.

Table 4 :
Goal-driven questions: QG model performance compared to the gold set of ground truth questions with different rewards.
Table 3 compares the two planning strategies.Results demonstrate that planning strategies improve the baseline methods by more than 3% on BLEU score with operators as planning, and by more than 7% with equations.

Table 3
QA performance We study the impact of the QG model considering both Socratic properties as shown in Table5.Sub-questions with operators and equations as planning improves the QA performance by 1 − 2%.Rewards, although improves the QG quality, have a negligible effect on QA perfor-

Table 6 :
Manipulating the planning inputs influences the quality of generated questions and overall QG model performance.same # has the same number of operators as number of reasoning steps but the types (+-/*) are shuffled, diff # has both number and type of operator shuffled.

Table 8 :
QG model performance compared on the gold set of ground truth test questions with different planning strategies in an iterative setting.Except global strategy to generate questions given a MWP, we experimented with iteratively generation on the sentence level.
Table9studies some of the errors encountered by us in our question generation strategy.