Guiding the Growth: Difficulty-Controllable Question Generation through Step-by-Step Rewriting

This paper explores the task of Difficulty-Controllable Question Generation (DCQG), which aims at generating questions with required difficulty levels. Previous research on this task mainly defines the difficulty of a question as whether it can be correctly answered by a Question Answering (QA) system, lacking interpretability and controllability. In our work, we redefine question difficulty as the number of inference steps required to answer it and argue that Question Generation (QG) systems should have stronger control over the logic of generated questions. To this end, we propose a novel framework that progressively increases question difficulty through step-by-step rewriting under the guidance of an extracted reasoning chain. A dataset is automatically constructed to facilitate the research, on which extensive experiments are conducted to test the performance of our method.


Introduction
The task of Difficulty-Controllable Question Generation (DCQG) aims at generating questions with required difficulty levels and has recently attracted researchers' attention due to its wide application, such as facilitating certain curriculum-learningbased methods for QA systems (Sachan and Xing, 2016) and designing exams of various difficulty levels for educational purpose (Kurdi et al., 2020).
Compared to previous QG works which control the interrogative word (Zi et al., 2019;Kang et al., 2019) or the context of a question (Liu et al., 2020(Liu et al., , 2019a, few works have been conducted on difficulty control, as it is hard to formally define the difficulty of a question. To the best of our knowledge,  is the only previous work of DCQG for free text, and defines question difficulty as whether a QA model can correctly answer it.
This definition gives only two difficulty levels and is mainly empirically driven, lacking interpretability for what difficulty is and how difficulty varies.
In this work, we redefine the difficulty level of a question as the number of inference steps required to answer it, which reflects the requirements on reasoning and cognitive abilities (Pan et al., 2019). Existing QA systems perform substantially worse in answering multi-hop questions than single-hop ones (Yang et al., 2018), also supporting the soundness of using reasoning hops to define difficulty.
To achieve DCQG with the above definition, a QG model should have strong control over the logic and reasoning complexity of generated questions. Graph-based methods are well suited for such logic modelling (Pearl and Paz, 1986;Zhang et al., 2020). In previous QG researches, Yu et al. (2020) and Pan et al. (2020) implemented graph-to-sequence frameworks to distill the inner structure of the context, but they mainly used graphs to enhance document representations, rather than to control the reasoning complexity of questions.
In this paper, we propose a highly-controllable QG framework that progressively increases difficulties of the generated questions through step-bystep rewriting. Specifically, we first transform a given raw text into a context graph, from which we sample the answer and the reasoning chain for the generated question. Then, we design a question generator and a question rewriter to generate an initial simple question and step-by-step rewrite it into more complex ones. As shown in Fig. 1, "Tom Cruise" is the selected answer, and Q 1 is the initial question, which is then adapted into Q 2 by adding one more inference step (i.e. N 1 ←N 2 ) in the reasoning chain. That is, it requires to infer "Top Gun" is "the film directed by Tony Scott" before answering Q 1 . Similarly, we can further increase its difficulty level and step-by-step extend it into more difficult questions (i.e., Q 3 , Q 4 and Q 5 ).
To train our DCQG framework, we design effective strategies to automatically construct the training data from existing QA datasets instead of building one from scratch with intensive human efforts. Specifically, we utilize HotpotQA (Yang et al., 2018), a QA dataset where most questions require two inference steps to answer and can be decomposed into two 1-hop questions. Thus, we get the dataset that contains 2-hop questions and their corresponding 1-hop reasoning steps. Having learned how to rewrite 1-hop questions into 2-hop ones with this dataset, our framework can easily extend to the generation of (n+1)-hop questions from n-hop ones only with a small amount of corresponding data, because the rewriting operation follows rather certain patterns regardless of the exact value of n, as shown in Fig. 1.
Extensive evaluations show that our method can controllably generate questions with required difficulty, and keep competitive question quality at the same time, compared with a set of strong baselines.
In summary, our contributions are as follows: • To the best of our knowledge, this is the first work of difficulty-controllable question generation, with question difficulty defined as the inference steps to answer it; • We propose a novel framework that achieves DCQG through step-by-step rewriting under the guidance of an extracted reasoning chain; • We build a dataset that can facilitate training of rewriting questions into more complex ones, paired with constructed context graphs and the underlying reasoning chain of the question.

Related Work
Deep Question Generation Most of the previous QG researches (Zhou et al., 2017;Pan et al., 2019;Liu et al., 2020) mainly focused on generating single-hop questions like the ones in SQuAD (Rajpurkar et al., 2016). In the hope that AI systems could provoke more in-depth interaction with humans, deep question generation aims at generating questions that require deep reasoning. Many recent works attempted to conquer this task with graph-based neural architectures. Talmor and Berant (2018) and Kumar et al. (2019) generated complex questions based on knowledge graphs, but their methods could not be directly applied to QG for free text, which lacks clear logical structures.
In sequential question generation, Chai and Wan (2020) used a dual-graph interaction to better capture context dependency. However, they considered all the tokens as nodes, which led to a very complex graph. Yu et al. (2020) tried to generate multi-hop questions from free text with the help of entity graphs constructed by external tools. Our work shares a similar setting with Yu et al. (2020), and we further explore the problem of how to generate deep questions in a more controllable paradigm.

Difficulty-Controllable Question Generation
DCQG is a relatively new task.  classified questions as easy or hard according to whether they could be correctly answered by a BERT-based QA model, and controlled the question difficulty by modifying the hidden states before decoding. Another research on QG for knowledge graphs (Kumar et al., 2019) estimated the question difficulty based on popularity of the named entity. They manipulated the generation process by incorporating the difficulty level into the input embedding of the Transformer-based decoder. In our work, we control the question difficulty based on the number of its reasoning hops, which is more explainable.
Question Rewriting It is another emerging trend in the recent researches, demonstrating benefits to both QG and QA tasks. With rewriting, QG models produced more complex questions by incorporating more context information into simple questions  (Elgohary et al., 2019;Vakulenko et al., 2020), and QA pipelines could also decompose the original complex question into multiple shorter questions to improve model performance Khot et al., 2020).

Method
Given input context text C and a specific difficulty level d, our objective is to generate a (question, answer) pair (Q, A), where A is a sub-span of C and Q requires d-hop reasoning to answer. Fig. 2 and Algorithm 1 give an overview of our proposed framework. First, we construct a context graph G CG corresponding to the given context, from which a subgraph G T is selected to serve as the reasoning chain of the generated question. Next, with the reasoning chain and other contextual information as input, a question generator (QG Initial ) produces an initial simple question Q 1 . Then, Q 1 is fed to a question rewriting module (QG Rewrite ), which iteratively rewrites it into a more complex question Q i (i = 2, 3, . . . , d).
In what follows, we will introduce the whole generation process in more details.

Context Graph Construction
We follow the method proposed by Fan et al. (2019) to build the context graph G CG . Specifically, we first apply open information extraction (Stanovsky et al., 2018) to extract subject, relation, object triples from context sentences. Each triple is then trans-formed into two nodes connected with a directed edge, like A Perfect Murder is −→ a 1998 American crime film in Fig. 2. The two nodes respectively represent the subject and object, and the edge describes their relation. Coreference resolution (Lee et al., 2017) is applied to merge nodes referring to the same entity. For instance, A Perfect Murder is merged with It in Fig. 2.
Reasoning Chain Selection With the context graph constructed, we sample a connected subgraph G T consisting of d + 1 nodes from it to serve as the reasoning chain of the generated question. A node N 0 is first sampled as the answer of the question, if it is, or linked with, a named entity that has more than one node degree. Next, we extract from G CG a maximum spanning tree G L , with N 0 as its root node, e.g., the tree structure shown in Fig. 1. G CG is temporarily considered as an undirected graph at this step. We then prune G L into G T to keep only d + 1 nodes. During pruning, we consider the sentence position where each node is extracted in order to make the reasoning chain relevant to more context. In the following, we will denote a node in G T as N i (i = 0, 1, . . . , d), where each node is subscripted by preorder traversal of G T , and N P (i) as the parent of N i .
Step-by-step Question Generation Our stepby-step QG process is described at lines 5-11 in Algorithm 1. The following notations are defined for clearer illustration: • Q i (i = 1, 2, . . . , d) represents the question generated at each step, where Q d is the final question Q, and Q i+1 is rewritten from Q i by adding one more hop of reasoning. • S i represents the context sentence from which we extract the triple Specifically, we consider two types of rewriting patterns in this work: Bridge and Intersection. As shown in Fig. 1, Bridge-style rewriting replaces an entity with a modified clause, while Intersection adds another restriction to an existing entity in the question. These two types can be distinguished by whether N i is the first child of its parent node, i.e., whether its parent node has already been rewritten once in Bridge style.
To generate the final question with the required difficulty level d, we first use a question generator QG Initial to generate an initial simple question based on N 1 , N 0 , and the corresponding context sentence S 1 . Then, we repeatedly (for d − 1 times) use QG Rewrite to rewrite question Q i−1 into a more complex one Q i , based on node N i and its parent node N P (i) , context sentence S i , and the rewriting type R i (i = 2, 3, . . . , d). Formally, the generation process of QG Initial and the rewriting process of QG Rewrite can be defined as: Algorithm 2 Procedure of Data Construction In our implementation, both QG Initial and QG Rewrite are initialized with the pre-trained GPT2-small model (Radford et al., 2019), and then fine-tuned on our constructed dataset (see Sec. 4). The encoder of QG Rewrite , as illustrated in Fig. 2, is similar to Liu et al. (2020). If N i points to N P (i) , then the input sequence is organized in the form of The positions of " nodeC Ni" and " nodeP N P (i) " will be exchanged if N P (i) points to N i . As for QG Initial , its input is organized in the same way except without " type Ri subq Qi−1". The segment embedding layer is utilized to identify different segments. For those parts in S i and Q i−1 that are the same as, or refer to the same entity as N P (i) , we replace their segment embeddings with the one of N P (i) , considering that the parent node of N i plays an important role in denoting what to ask about, or which part to rewrite, as shown in Fig. 1.

Automatic Dataset Construction
Manually constructing a new dataset for our task is difficult and costly. Instead, we propose to automatically build a dataset from existing QA datasets without extra human annotation. In our work, the training data is constructed from HotpotQA (Yang et al., 2018), in which every context C consists of two paragraphs {P 1 , P 2 }, and most of the questions require two hops of reasoning, each concerning one paragraph. HotpotQA also annotates supporting facts F, which are the part of the context most relevant to the question. In addition to the information already available in HotpotQA, we also need the following information to train QG Initial and QG Rewrite : i) (Q 1 , A 1 ), the simple initial question and its answer, which are used to train QG Initial ; ii) R 2 , the type of rewriting from Q 1 to Q 2 ; iii) {N 0 , N 1 , N 2 }, the reasoning chain of Q 2 ; and iv) S i (i = 1, 2), the context sentences where we extract N 0 , N 1 and N 2 .
Algorithm 2 describes our procedure to obtain the above information. The construction process is facilitated with the help of a reasoning type classifier (TypeClassify) and a question decomposer (DecompQ), referring to . For each question in HotpotQA (i.e. Q 2 ), we first distinguish its reasoning type, and filter out those that are not Bridge and Intersection. The reasoning type here corresponds to the rewriting type R i . Then, DecompQ decomposes Q 2 into two subquestions, subq 1 and subq 2 , based on span prediction and linguistic rules. For example, the Q 2 in Fig. 2 will be decomposed into subq 1 ="To which film A Perfect Murder was a modern remake?", and subq 2 ="Who directed Dial M for Murder?". After that, an off-the-shelf single-hop QA model  is utilized to acquire the answer of the two sub-questions, which should be "Dial M for Murder" and "Alfred Hitchcock" in the example.
As for Q 1 , it is one of the sub-questions. When Q 2 is of the Intersection type, Q 1 can be either subq 1 or subq 2 . For the Bridge type, it is the subquestion that shares the same answer as A 2 . For the example above, Q 1 is subq 2 because suba 2 = A 2 . The context sentence S i is supposed to provide supporting facts contained in the paragraph F that concerns Q i (i = 1, 2). For the reasoning chain, it is selected from the local context graph by first locating N 2 and then finding N 0 , N 1 through text matching with the two sub-questions.

Experiments
In the following experiments, we mainly evaluate the generation results of our proposed method when required to produce 1-hop and 2-hop questions, denoted as Ours 1-hop and Ours 2-hop . In Sec. 5.2, we compare our method with a set of strong baselines using both automatic and human evaluations on question quality. In Sec. 5.3, we provide controllability analysis by manually evaluating their difficulty levels and testing the performance of QA systems in answering questions generated by different methods. In Sec. 5.4, we test the effect of our generated QA pairs on the performance of a multi-hop QA model in a data augmentation setting.
In Sec. 5.5, we further analyze the extensibility of our method, i.e., its potential in generating questions that require reasoning of more than two hops. Our code and constructed dataset have been made publicly available to facilitate future research. 1

Experimental Setup
Datasets The constructed dataset described in Sec. 4 consists of 57,397/6,072/6,072 samples for training/validation/test. For context graph construction, we use the coreference resolution toolkit from AllenNLP 1.0.0 (Lee et al., 2017)  which separately encodes answer and context. • SRL-Graph and DP-Graph (Pan et al., 2020) are two state-of-the-art QG systems. They encode graph-level and document-level information with an attention-based Graph Neural Network (GNN) and a bi-directional GRU, respectively. SRL-Graph constructs the semantic graph by semantic role labelling, and DP-Graph by dependency parsing. • GPT2 is a vanilla GPT2-based QG model. Its input is the concatenation of context and sampled answer. The position where the answer appears in the context segment is denoted in the segment embedding layer.

Implementation Details
The baseline models are trained to directly produce the 2-hop questions, while QG Initial and QG Rewrite are respectively trained to generate 1-hop questions and rewrite 1-hop ones into 2-hop. QG Initial , QG Rewrite , and GPT2 are initialized with the GPT2-small model from the HuggingFace Transformer library , and fine-tuned for 8, 10, and 7 epochs, respectively, with batch size of 16. We apply top-p nucleus sampling with p = 0.9 during decoding.  AdamW (Loshchilov and Hutter, 2017) is used as optimizer, with the initial learning rate set to be 6.25×10 −5 and adaptively decays during training. For DP-Graph, we use their released model and code to perform the experiment. For the other three baselines, we directly refer to the experiment results reported in Pan et al. (2020). The performances of these baselines are compared under the same setting as in Pan et al. (2020), where each context is abbreviated to only include the supporting facts and the part that overlaps with the question. More implementation details can be found in our code and the supplementary materials.

Evaluation of Question Quality
Automatic Evaluation The automatic evaluation metrics are BLEU3, BLEU4 (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), and CIDEr (Vedantam et al., 2015), which measure the similarity between the generation results and the reference questions in terms of n-grams. As the four baselines are trained to generate 2-hop questions only, we only compare them with Ours 2-hop . As shown in Table 1, we can see that Ours 2-hop and GPT2 perform consistently better than the others. Though the performances of Ours 2-hop and GPT2 are close in terms of automatic metrics, we observe that the questions generated by Ours 2-hop are usually more well-formed, concise and answerable, as illustrated in Table 2. These advantages cannot be reflected through automatic evaluation. What was the review score for the album that has been reissued twice?

Human Evaluation
What was the review of the album that includes previously unreleased tracks by Guetta from its first major international release? • Well-formed: It checks whether a question is semantically correct. Annotators are asked to mark a question as yes, acceptable, or no. Acceptable is selected if the question is not grammatically correct, but its meaning is still inferrable. • Concise: It checks whether the QG models are overfitted, generating questions with redundant modifiers. The question is marked as yes if no single word can be deleted, acceptable if it is a little lengthy but still in a natural way, and no if it is abnormally verbose. • Answerable: It checks whether a question is answerable according to the given context. The anonnotion is either yes or no. • Answer Matching: It checks whether the given answer is the correct answer to the question. The anonnotion is either yes or no.
The results are shown in Table 3. Overall, we can see that Ours 2-hop performs consistently better than DP-Graph and GPT2 across all metrics and comparable to the hand-crafted reference questions. Our method performs especially well in terms of concise, even better than the reference questions. For reference, the average word number of the questions generated by DP-Graph, GPT2, Ours 2-hop , and Gold 2-hop are 19.32, 19.26, 17.18, 17.44, respectively. It demonstrates that the enriched graph information and our multi-stage rewriting mechanism indeed enhance the question structure and content. In comparison, we find that the questions generated by the two baselines tend to unreasonably pile too many modifiers and subordinate clauses.   As for the 1-hop questions, Ours 1-hop performs well in terms of answerable and answer matching, but not so competitive in terms of well-formed, mainly due to the limitation of its training data. As the 1-hop reference questions (Gold 1-hop ) are automatically decomposed from the hand-crafted 2-hop questions, a significant portion (44%) of them have some grammatical errors, but most of them are still understandable despite that.

Controllability Analysis
Human Evaluation of Controllability For controllability analysis, we manually evaluate the numbers of inference steps involved in generated questions. DP-Graph and GPT2 are also evaluated for comparison. The results are shown in Table 4. 70.65% of Ours 1-hop require one step of inference and 67.74% of Ours 2-hop require two steps, proving that our framework can successfully control the number of inference steps of most generated questions. In comparison, DP-Graph and GPT2 are not difficulty-aware and their generated questions are more scattered in difficulty levels.
Difficulty Assessment with QA Systems For further assessment of question difficulty, we test the performance of QA models in answering questions generated by different models. Specifically, we utilize two off-the-shelf QA models provided by the HuggingFace Transformer library , which are respectively initialized with  BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019b), and then fine-tuned on SQuAD (Rajpurkar et al., 2016). We select those generated questions that are ensured to be paired with correct answers by the human evaluation described in Sec. 5.2, and test the performance of two QA models in answering them. The evaluation metrics include Exact Match (EM) and F1. The results are shown in Table 5. We can see that questions generated by Ours 2-hop are more difficult than Ours 1-hop not only to humans (requiring more hops of reasoning), but also to the state-ofthe-art QA models. In comparison, with a more scattered mix of 1-hop and 2-hop questions, the performances on DP-Graph and GPT2 are between Ours 1-hop and Ours 2-hop . This result demonstrates that our method can controllably generate questions of different difficulty levels for QA systems and that inference steps can effectively model the question difficulty.

Boosting Multi-hop QA Performance
We further evaluate whether the generated QA pairs can boost QA performance through data augmentation. Specifically, we heuristically sample the answers and reasoning chains from the context graphs in our constructed dataset to generate 150,305 twohop questions. As a comparison, we utilize GPT2 to generate the same amount of data with the same sampled answers and contextual sentences. Some low-quality questions are filtered out if their word counts are not between 6∼30 (4.7% for ours and 9.2% for GPT2), or the answers directly appear in the questions (2.7% for ours and 2.4% for GPT2). Finally, we randomly sample 100,000 QA pairs and augment the HotpotQA dataset with them. A DistilBERT-based  QA model is implemented. It takes as input the concatenation of context and question to predict the answer span. To speed up the experiment, we only consider those necessary supporting facts as the question answering context. During training, the original samples from HotpotQA are oversampled to ensure that they are at least 4 times as the generated data. We use Adam (Kingma and Ba, 2015) as the optimizer, with the mini-batch size of 32. The learning rate is initially set to 3×10 −5 and adaptively decays during training. The configurations are the same in all the QA experiments, except that the training datasets are different combinations of HotpotQA and the generated data. The validation and test sets are the same as those in HotpotQA.
We test the impact of the generated data under both high-resource (using the whole training set of HotpotQA) and low-resource settings (using only 25% of the data randomly sampled from HotpotQA). Fig. 3 compares the QA performance, augmented with different quantities of the data generated by our method and by GPT2, respectively. We can see that under both settings, our method achieves better performance than GPT2. Under the low-resource setting, performance boost achieved by our generated data is more significant and obviously better than that of GPT2. The performance of the QA model steadily improves when the training dataset is augmented with more data. EM and F1 of the QA model are improved by 2.56% and 1.69%, respectively, when 100,000 samples of our generated data are utilized. What actor starred in the film that was directed by Tony Scott and was released in 1986?

More-hop Question Generation
To analyze the extensibility of our method, we experiment with the generation of questions that are more than 2-hop, by repeatedly using QG Rewrite to increase question difficulty. Fig. 4 shows two examples of 3-hop question generation process. The two intermediate questions and the corresponding reasoning chains are also listed for reference.
We can see that the intermediate questions, serving as springboards, are effectively used by QG Rewrite to generate more complex questions. With the training data that only contains 1-hop and 2-hop questions, our framework is able to generate some high-quality 3-hop questions, demonstrating the extensibility of our framework. It can be expected that the performance of our model can be further strengthened if a small training set of 3-hop question data is available.
Besides, it can also be observed that though the contexts and answers of these two questions are the same, two different questions with different underlying logic are generated, illustrating that the extracted reasoning chain effectively controls the question content.
However, when generating questions with more than 3 hops, we find that the question quality drastically declines. The semantic errors become more popular, and some content tend to be unreasonably repeated. It is probably because the input of QG Rewrite has become too long to be precisely encoded by the GPT2-small model due to the growing length of the question. It will be our future work to explore how to effectively extend our method to more-hop question generation.

Conclusion
We explored the task of difficulty-controllable question generation, with question difficulty redefined as the inference steps required to answer it. A step-by-step generation framework was proposed to accomplish this objective, with an input sampler to extract the reasoning chain, a question generator to produce a simple question, and a question rewriter to further adapt it into a more complex one. A dataset was automatically constructed based on HotpotQA to facilitate the research. Extensive evaluations demonstrated that our method can effectively control difficulty of the generated questions, and keep high question quality at the same time.