SkillQG: Learning to Generate Question for Reading Comprehension Assessment

We present $\textbf{$\texttt{SkillQG}$}$: a question generation framework with controllable comprehension types for assessing and improving machine reading comprehension models. Existing question generation systems widely differentiate questions by $\textit{literal}$ information such as question words and answer types to generate semantically relevant questions for a given context. However, they rarely consider the $\textit{comprehension}$ nature of questions, i.e. the different comprehension capabilities embodied by different questions. In comparison, our $\texttt{SkillQG}$ is able to tailor a fine-grained assessment and improvement to the capabilities of question answering models built on it. Specifically, we first frame the comprehension type of questions based on a hierarchical skill-based schema, then formulate $\texttt{SkillQG}$ as a skill-conditioned question generator. Furthermore, to improve the controllability of generation, we augment the input text with question focus and skill-specific knowledge, which are constructed by iteratively prompting the pre-trained language models. Empirical results demonstrate that $\texttt{SkillQG}$ outperforms baselines in terms of quality, relevance, and skill-controllability while showing a promising performance boost in downstream question answering task.


Introduction
Question generation (QG) systems aim to generate natural language questions conditioned on a text passage.As a dual task of question answering (QA), QG is widely applied to create questionanswer pairs as data augmentation for QA training (Zhang and Bansal, 2019;Liu et al., 2020;Chen et al., 2019a), help chatbots continue a conversation with human users (Mostafazadeh et al., 2016; † Canada CIFAR AI Chair.

Context:
The princess climbed out the window of the high tower and climbed down the south wall when her mother was sleeping.She wandered out a good way.Finally, she went into the forest where there are no electric poles.Q1 Q1 Q1: Who climbed out of the castle?A A A: Princess.Q2 Q2 Q2: Why did the princess climb out when her mother was sleeping?A A A: In case of being caught.Q3 Q3 Q3: What would happen if her mother was not sleeping?A A A: The princess would be caught soon.
Most prior QG research has typically focused on generating factoid-based questions that are relevant to a piece of the fact of a single sentence (Zhou et al., 2017;Liu et al., 2019;Zhao et al., 2022).Recently, motivated by building the read comprehension (RC) systems that are competent in understanding and reasoning (Kaushik and Lipton, 2018;Sinha et al., 2019;Chen et al., 2019b), there is an increasing interest in developing systems that are capable of generating deep questions (Chen et al., 2020;Pan et al., 2020;Fei et al., 2022).However, these works generate diverse questions by relying on different surface-level mentioned information (Cheng et al., 2021;Kai et al., 2021) and consider primarily simple connections between two facts in the context (e.g.bridge and intersection).Less explored have been more facts and the deeper comprehension types between them (Desai et al., 2018), such as analysis of discourse relations (Johnstone, 2017), a thorough evaluation of stated arguments, and deduction of the high-level semantics (Gao et al., 2022).As shown in Figure 1, Q 1 asks for the mentioned facts in stories (e.g."The princess climbed out the window of the high tower"), whereas Q 2 and Q 3 ask for a deep connection about the events (causal relation in Q 2 and future prediction in Q 3 ).
We argue that generating questions with deeper comprehension brings two major benefits: (i) compared with factoid-based QG models, it reflects higher cognitive skills and requires an in-depth understanding of the input text and reasoning over relevant contexts, better imitating how human intelligence embodies the application and integration of skills; (ii) compared with existing deep QG models, it can help build more controllable questions with different comprehension types rather than literal information such as answer types.Based on such questions, we can better identify the downstream performance of QA systems in specific comprehension types, and assess their corresponding intrinsic ability, further allowing us to provide tailored guidance to them and improve training efficiency.
In this paper, we propose SkillQG: a question generation framework with controllable comprehension types.Specifically, we define the comprehension types as five skill dimensions ordered by cognitive complexity: REMEMBER, UNDER-STAND, ANALYZE, CREATE, and EVALUATE, which are inspired by Bloom's Taxonomy (Krathwohl, 2002), an educational schema by which teachers structure a curriculum to ensure that learners possess the necessary abilities before progressing to more complex tasks.Based on the definition, we can better differentiate questions from cognitive demands than previous surface-level information and formulate SkillQG as question generation conditioned on the given comprehension skill.
Furthermore, to improve the specificity of generating questions with a certain comprehension skill, we devise a set of prompts based on the indicative words and question templates of Bloom's Taxonomy.Using these prompts to iteratively elicit chain-of-thought reasoning of pre-trained language model (PLM), we explicitly generate question focuses (what to ask about) and skill-specific knowledge (how to ask it) to augment the input context.
Finally, to evaluate the SkillQG framework, we introduce evaluation protocols covering question content quality, skill controllability, and downstream QA performance improvement when incorporating the generated questions as additional training data.Our experimental results show that SkillQG can produce more relevant and skillcontrollable questions compared to baseline QG models, and boost the QA performance significantly.

Methodology
In this section, we elaborate our SkillQG for generating skill-infused questions.Specifically, we first define the comprehension types of questions as a 5-dimensional skill schema, which is drawn upon Bloom's Taxonomy (Krathwohl, 2002) of research in cognitive science and describes the cognitive load of different levels of topics or samples.Based on this schema, we categorize the questions into different comprehension skills, regarding SkillQG as a conditional generator given a skill.Furthermore, to improve the controllability of the skill-infused questions, we adapt the indicative words and templates of Bloom's Taxonomy as a set of prompts to discover question focuses and skill-specific knowledge by prompting PLM iteratively.Finally, these question focuses and knowledge text act as auxiliary inputs to steer the question generator.

Formulation of Comprehension Types
Question generation has long served as an essential component for knowledge learning (Tobin, 1990;Lai et al., 2017) and assessing learning progress (Holme, 2003;Yudkowsky et al., 2019), especially asking questions about texts at various comprehension levels deepens the understanding of the text and aids in the learner's understanding and growing from what they have read (Holme, 2003).Among relevant research in cognitive science and pedagogy, Bloom's Taxonomy (Krathwohl, 2002) is one of the most basic and influential theories.Bloom's Taxonomy is a cognition model used for the classification of educational learning objectives into levels of complexity and specificity, including knowledge, comprehension, application, analysis, synthesis, and evaluation.Inspired by the hierarchical cognitive objectives of Bloom's Taxonomy, we define the comprehension types of questions as a 5-dimensional skill-based schema in Table 1.We sketch out the meaning of each comprehension skill with some examples as follows.REMEMBER.The objective of this skill is to promote retention of the presented material in the same form as it exists.Therefore, it requires retrieving relevant content from what a model has read, e.g.recall the dates of some events in the input passage.Empirically, Sugawara et al. (2018) has shown that some questions can be answered correctly by just string-based matching with the given passage.In this study, the factoid-based questions (Zhou et al., 2017)  explicit mentions and definition questions are categorized into this kind of comprehension skill.UNDERSTAND.To build a holistically semantic representation of text from recalled facts in the passage, the easiest way is to build connections between the "new" knowledge to be gained and their prior knowledge.We exemplify four kinds of questions to represent this skill, consisting of interpreting (e.g.paraphrase important speeches and documents.),classifying (e.g.classify observed or described cases of mental disorders.),summarizing (e.g.write a brief summary of the events portrayed on a videotape.),and comparing (e.g.compare historical events to contemporary situations.).ANALYZE.To step towards a higher comprehension skill, break-down-then-combination is required.This skill aims to break facts into their constituent parts and determine how the parts are related to one another.It usually involves the relationships between two events that are causally related where the prior events causally lead to the latter event in question.Similarly, Ko et al. (2020) reveals that cause-effect analysis is more challenging in understanding tasks than bridging or comparing the known facts, particularly for the cases where the passage contains no explicit causal conjunctions and corresponding background knowledge is required.Therefore, we include explanation (e.g.

SkillQG
Based on the formulation of comprehension types, we follow the common question generation setup (Zhou et al., 2017;Liu et al., 2020) and frame SkillQG by a sequence-to-sequence question generator.Formally, given a context c, answer a and comprehension skill s, we aim to generate a question q that reflects the corresponding skill by modeling the conditional probability p θ (q | c, s, a): where T is the length of generated question comprised of a sequence of tokens q = ⟨q 1 , • • • , q t , • • • , q T ⟩, and the generator is parameterized by θ.To improve the controllability of generation, we further guide the generator with question-worthy concepts and skill-specific knowledge.Precisely, we leverage chain-of-thought prompting (Wei et al., 2022;Madaan et al., 2022) of PLM, a prompting paradigm of successively eliciting relevant knowledge from PLM, to steer the

Question Generation
Figure 2: Illustration of SkillQG pipeline.A skill-infused question (Q) is generated from the following steps: question focus generation, skill-specific knowledge generation, and question generation conditioned on the corresponding context (C), question focus (F), and elicited knowledge (K).The PLM represents an off-the-shelf GPT2 model, while the generator is initialized from a pre-trained BART model and fine-tuned on the training set.
generation of skill-infused questions.Based on it, we can first capture the question focuses and then externalize the implicit knowledge required for mastering the given comprehension skill.
As illustrated in Figure 2, we design several pairs of templates for each comprehension level, i.e.Ftemplate and K-template, denoted as T F and T K respectively.These template pairs are with a form of information-seeking questions (Bruner, 1961), such as "What is the definition of _ " and "The definition of _ is _ ", which can help PLM talk with itself to explicitly discover what it cares about when given a comprehension skill.More specifically, the T F together with the input context is used to construct the prompt input for discovering possible question focuses by template-infilling, while the T K can generate skill-related knowledge based on the context and question focuses.Finally, we take the generated knowledge as an auxiliary context and expect that it can contribute to improving the generation quality.Denoting the question focus and knowledge text as f and k, respectively, the above procedure can be formulated as: where M denotes the employed PLM, i.e.GPT2, P F and P K represents the prompt input constructed by T F and T K , respectively.Aug(c, f, k) means augmenting the original context with elicited question focus and knowledge text.
Question focus generation.To improve the controllability of generated questions, we take inspiration from the chain-of-thought prompting to capture question focuses and skill-related knowledge.Precisely, considering the close association between the comprehension skill and its involved narrative elements and questioning styles, we devise several pairs of F-template T F and K-template T K for each skill.An example is shown in Figure 2 and all of the templates are summarized in Appendix A. They are adapted from the indicative words and question templates of Bloom's Taxonomy.After that, question focus is generated by feeding the context and T F into PLM.Following the prompt format of causal language models such as GPT2 (Radford et al., 2019), the prompt input P F in Eq. 2 of question focus generation is built as: Implicit knowledge generation.We further utilize K-template T K to inquire PLM for generating skill-related knowledge.This kind of knowledgeexternalization method has shown substantial improvements in zero-shot commonsense reasoning (Shwartz et al., 2020).Differing from Shwartz et al. (2020) heuristic designs for sample patterns of different datasets, our T K is based on our hierarchical comprehension skills and collaborates with the question focus to develop a complete chain of thought of PLM.To be specific, the prompt input P K (c, f ) in Eq. 3 of skill-specific knowledge generation is represented as: where T F (f ) means infilling the T F with corresponding generated question focus f .Model training.To augment the original input context, we first fill the F-template and K-template with the generated question focus and knowledge text.After that, we append them to the original context to obtain the augmented input: Furthermore, to help our SkillQG learn the relationship between multiple pieces of input text and capture their functions, we utilize natural language prompts as well as special tokens as the delimiter to combine the multiple inputs into a single sequence, i.e. including the knowledge-augmented context c, answer text a, and skill s.This kind of method has been proven to help better learn the relationship between multiple pieces of input text and capture their functions, improving performance on various tasks (Schick and Schütze, 2021;Zhou et al., 2022).Formally, the input sequence fed into our question generator is as follows: where [CXT], [ANS] and [SKL] are special tokens to mark the boundary between multiple input sequences (Radford et al., 2019).c , a and s are formulated as the corresponding context text, answer text, and skill name, respectively.After that, the sequence is fed into a BART-base (Lewis et al., 2020) question generator which models the probabilities p θ (q | c, s, a) in Eq. 1 by minimizing the conditional negative log-likelihood (NLL) loss: where pθ (q t | q <t , c, s, a) denotes the predicted probability for the token in the reference question.

Experiments
Datasets.We employ the official train and dev splits of FairytaleQA dataset (Xu et al., 2022) to train our SkillQG.This dataset, focusing on narrative comprehension of English text for both machines and young children, is annotated with seven fine-grained skills comprised of Character, Setting, Action, Feeling, Causal relationship, Outcome resolution, and Prediction.Its annotation process is supervised by three experts in literacy education and its categorization of questions is based on prior educational research (Paris and Paris, 2003) so that we can easily match the samples of the FairytaleQA dataset with our defined skill schema.Table 2 presents this mapping relationship and corresponding breakdown statistics of the dataset.
Baselines.We compare SkillQG to the following two types of QG baselines.The first type is typically trained without the knowledge input, including NQG++ (Zhou et al., 2017), and QAG (Yao et al., 2022).The other is knowledge-augmented generators consisting of CsQG (Xin et al., 2021) and CQG (Fei et al., 2022), which retrieve external knowledge from knowledge bases or generate knowledge with another model and regard the knowledge as extra context to generate questions.

Evaluation Protocol
Automatic evaluation metrics.We use standard question generation metrics to evaluate the question quality from the following three aspects.The syntactic similarity between generated questions and reference is measured by BLEU-4 (Papineni et al., 2002) and ROUGE-L (Lin, 2004).The answerability and structural integrity of generated questions is gauged by Q-BLEU-4 (Nema and Khapra, 2018).The relvance of generated questions to the reference is evaluated by BERTScore (Zhang et al., 2019), while that to the given context is evaluated by the factuality dimension of CTC (Deng et al., 2021) and BARTScore (Yuan et al., 2021) 2010) and Nema and Khapra (2018), we conduct pairwise comparison where we present a context and two questions made by two different models and ask the annotators to choose the better of the two or "tie" in terms of grammaticality, answerability, and relevance.We report the percentage of times annotators prefer each model to NQG++ and ties, i.e. wins/ties ratio.For skill controllability, we ask the annotator to read the context, the generated questions, and the corresponding answer, choose the evidence sentences in context, and then respectively annotate the required comprehension skill from our defined 5-dimensional skill schema.Please refer to Appendix C for more details about the annotation.

Main Results
Table 3 summarizes the quantitative results on the FairytaleQA dataset.On the one hand, compared with the baselines without extra knowledge (i.e.NQG++, QAG, QTD), SkillQG achieves obviously higher metrics scores in terms of answerability, and relevance, demonstrating the significant contribution of incorporating extra knowledge and question focuses to generate the questions.The comparable results on syntactic similarity metrics may be attributed to the wrong penalization of these metrics to the novel generation of our SkillQG.On the other hand, SkillQG consistently outperforms all the knowledge-augmented baselines (i.e.CsQG and CQG) by a considerable margin (i.e. gain ratio > 5%), which indicates the effectiveness of externalized knowledge by our devised prompts.Inter-annotator agreement.For the examined two aspects of human evaluation, i.e. question content quality and skill-controllability, the interannotator Krippendorff's α for them are 87.20 and 90.73, respectively, which demonstrates an accept-  able level of agreement (> 80%) between annotators (Krippendorff, 2004).Then, we invite the annotators to discuss the few annotation conflicts before the final annotations are determined.Specifically, since the skill-controllability is calculated as the accuracy between the given skill and the annotated one, the annotators are asked to discuss the discrepancies of annotated skills and reach as unique skill annotation as possible for every sample, which then is used as the final annotation.Question content quality.As shown in Table 4, the pairwise comparisons show that SkillQG produces more grammatical and relevant questions and questions that are mostly answerable (> 50%), compared to all baseline models.Besides, knowledge-augmented baselines (lower part in Table 4) consistently receive more preference from annotators than others (upper part in Table 4).It demonstrates that the generated skill-specific knowledge indeed enhances the question content and relevance.Skill controllability.Figure 3 reports the consistency between the given skill name that SkillQG generates questions conditioned on and the one chosen by the annotators, i.e. skill accuracy.We can see our SkillQG surpasses other baselines by a significant margin, and this becomes more obvious to the skills that have a relatively smaller number of samples in the dataset, i.e. around 30% gain in CREATE and EVALUATE dimension.It justifies that SkillQG can not only successfully control the comprehension skill of generated questions, but also be able to learn the underrepresented skills in the dataset, owing to the built prompts containing indicative words of different comprehension skills and the rich skill-specific knowledge of language models.Please refer to Section D for more results.

Ablation Analysis
We conduct ablation experiments and summarize the results in Table 5 from the following aspects.First, How do the special symbols and prompts of input representation contribute to the generation quality?The first three baselines combine multiple input sequences (i.e.context, answer, and skill) with the concatenation operation, special symbols or natural language prompts, denoted as "concatonly (M 1 )", "symbol-only (M 2 )" and "promptonly (M 3 )", respectively.As shown in Table 5, we can observe that M 1 achieves worse performance than M 2 and M 3 , demonstrating that simple concatenation operation cannot encode the input sequences well.Besides, both M 2 and M 3 degrade the performance w.r.t.SkillQG, showing the integration of special symbols and natural language prompts can help the generator better understand the relationship between multiple input sequences and improve the final quality.
Second, What is the impact of question focus and skill-specific knowledge?The baseline "generator (M 4 )" does not utilize skill-specific knowledge to augment the context and trains the question generator directly, i.e. a BART model for question generation, while the baseline "conceptnet (M 5 )" is trained in the similar setting to SkillQG but its extra knowledge is attained by retrieving the Con-ceptNet rather than inquiring PLM.We perform alignment between the context and ConceptNet following the embedding-based matching as Zhou et al. (2022).In Table 5, we can find that the contribution of extra knowledge from PLM (SkillQG v.s.M 4 ) is more significant than that from the Con-ceptNet (M 5 v.s.M 4 ).A possible reason is that chain-of-thought prompting of PLM can reflect better relevance and specificity of knowledge to the given context and the required comprehension skill compared to matching with the limited number of triplets in a knowledge base.This result also agrees with the recent study on evaluating PLM as a knowledge base (Heinzerling and Inui, 2021).

Boosting QA Performance using Unlabeled Corpus
We further evaluate whether the skill-controllable questions can improve QA performance through data augmentation and help us better understand the QA models' intrinsic ability.Specifically, we first devise an information extractor to obtain ⟨passage, skill, answer⟩ combinations on an unlabeled corpus, i.e. the passage without annotations of the question, answer, and skill.After that, we feed the extracted combinations of ⟨ passage, answer, skill ⟩ into SkillQG to generate skill-infused questions.Finally, we put the generated questions into the FairytaleQA training set and train a QA model with such an augmented dataset to further evaluate the effectiveness of our SkillQG.Information extraction.Since the answer and required skill are dependent on each other, we cannot sample the combinations of ⟨ passage, answer, skill ⟩ randomly.Following the widely adopted solutions (Liu et al., 2020;Ghanem et al., 2022), we decompose the process into two steps to sequentially sample the required skills, and corresponding answers to select reasonable combinations.Formally, the sampling procedure can be written as: where p (s | c) and p (a | c, s) are devised as a model-based and rule-based extractor, respectively.On the one hand, p (s | c) is formulated as a multi-label classification task because a passage may involve more than one skill.We first finetune a DistilBERT model (Sanh et al., 2019) on the FairytaleQA dataset to learn skill-related patterns in the context.After that, we use it to predict the candidate skills when given an unlabeled passage.
On the other hand, we borrow the statistical analysis on the FairytaleQA dataset from Yao et al. (2022)   the named entities, such as a mentioned name and a particular place.Therefore, we resort to the Spacy tool (Honnibal and Montani, 2017) to extract named entities as the candidate answers.Other skills, i.e. the narrative elements consisting of action, causal relationship, outcome resolution, and prediction are mainly made up of the action events.Thus, we first leverage Propbank's semantic role labeler (Johansson and Nugues, 2008) to extract the trigger verb as well as the involved subject and object and then concatenate them into a complete sentence as the candidate answers.
We conduct the sampling procedure on the passages of FairytaleQA training set and discard all their annotations, then feed the extracted ⟨ passage, answer, skill ⟩ into SkillQG by keeping all beam search (size=8) outputs for each sample.Consequently, we can generate diverse questions for the existing paragraphs in the FairytaleQA training set.Finally, we randomly select 80,000 candidate questions and augment the FairytaleQA training set with them.As a comparison, following the same setting as above, we design a baseline by utilizing CQG as the question generator, which is one of the most competitive metrics in Table 3.
We train a state-of-the-art QA baseline (Xu et al., 2022) on the augmented dataset to further evaluate the quality of generated questions.Following Xu et al. (2022), we report the QA performance in Rouge-L F 1 Score which is a commonly used metric for generative question answering.The results in a high-resource setting (with the whole FairytaleQA training set) and a low-resource setting (with only 25% of data sampled from the original FairytaleQA training set) are illustrated in Fig- ure 4a.We can observe that the questions generated by SkillQG can improve the QA performance to a greater extent than CQG under both settings.In particular, the QA model under the low-resource setting achieves a comparable performance to the high-resource setting when leveraging the 100% additional samples generated by our SkillQG.
Furthermore, Figure 4b illustrates the decomposed performance of SkillQG under the lowresource setting (i.e."25%FairytaleQA + SkillQG" setting shown in Figure 4a ) alongside the defined skill dimension.This result shows that the questions generated by SkillQG can significantly boost all of the comprehension capabilities for the QA model.Among them, the cognitively challenging ones that the QA model struggles in, such as EVALUATION and CREATE, even achieve the largest improvement.It demonstrates that the skill-controllable questions that generated by the SkillQG can compensate for the limited number of training samples in the FariytaleQA dataset and are favorable for the fine-grained assessment of comprehension capability of QA models.

Related Work
Deep question generation.Previous QG systems mainly generated factoid-based questions by a sequence-to-sequence model (Zhou et al., 2017;Liu et al., 2019), a PLM (Liu et al., 2020), or a graph-based architecture (Talmor and Berant, 2018;Kumar et al., 2019).Recent-emerged QG models aimed at generating questions that require deep reasoning.On the one hand, Cheng et al. (2021) proposed to generate difficulty-controllability questions through step-by-step rewriting, while Bi et al. (2021) decoded multi-hop questions by a soft template.On the other hand, Yao et al. (2022) and Zhao et al. (2022) devised educational question generators to facilitate the assessment of children's literacy.Our SkillQG is inspired by their fine-grained analysis but driven by the motivation that generating questions with deep comprehension is beneficial to QA training.More recently, Cao and Wang (2021) charted a new question ontology, but they focused on constructing diversified open-ended questions from the specified question types.Knowledge-augmented generation.Although explicit knowledge generation has been explored in natural language understanding (Liu et al., 2022;Wei et al., 2022), similar research on natural language generation (Zhou et al., 2022), especially for QG is relatively rare (Rajani et al., 2019).Xin et al. (2021) retrieved knowledge triplets from Concept-Net (Speer et al., 2017) to enhance the QG models, while Fei et al. (2022) adopted a graph attention networks (GAT) (Veličković et al., 2018) to capture focuses for question generators.We considered the lessons of these works and extend the knowledge source with pre-trained language models.

Conclusion
Existing QG systems focus on the literal nature of questions and rarely consider the comprehension types of the generated questions.To better assess and improve machine reading comprehension models, we propose SkillQG to generate questions with controllable comprehension types.Besides, we engage the question focus and specific knowledge to improve the controllability of generation.Empirical results show that SkillQG outperforms baselines while achieving a significant performance boost in downstream QA training.

Limitations
Our work proposes a new QG framework, namely SkillQG, to frame the comprehension skill required by a question and generate the corresponding comprehension-oriented questions.The limitations are three-fold: Firstly, we propose a new skill-based schema for the comprehension nature of questions and map the existing annotations on narrative elements of the FairytaleQA dataset to it and conduct our experiments.This kind of mapping might not reflect the required skills accurately since a narrative element can cover more than one comprehension types.Furthermore, although our proposed skill-based schema is drawn upon general text comprehension, SkillQG is only verified on the Fairy-taleQA dataset and lacks the analysis on generalizability.However, identifying skills and correlations with comprehension skills on new datasets can be challenging because SkillQG may struggle with the input passage with a relatively simple discourse structure, which usually does not contain complicated relations.One remedy to this issue could be collecting a new QA dataset with the annotations following our proposed schema.We regard it as our future work and deem designing a new annotation specification a promising direction.
Besides, although we boost the downstream QA performance in Section 3.4 by augmenting the original training set with generated questions, the final performance (56.9%) is also far behind the human performance (64.4%) reported by Xu et al. (2022).However, the breakdown analysis of QA performance demonstrates that SkillQG can strengthen all of the comprehension capabilities, especially the challenging ones.As a result, generating questions that are matched with the current comprehension capabilities of the QA model and co-evolving the QA system and corresponding QG system, could be two interesting research topics.
Last but not least, our SkillQG is built on the PLMs of general domains, ignoring the domainspecific and multilingual application.The backbone PLMs are also shown a biased representation, such as race and gender (Gonen and Goldberg, 2019).Therefore, additional evaluation protocols are left for our future work.runs for automatic evaluation and conduct the human evaluation on the candidates generated by a single run.

C Annotation Details
Our human evaluation is conducted by a total of five annotators.All of the annotators are from China, between 25 and 30 years old, competent in English, and studying as Computer Science graduates.They are informed of the necessary background knowledge on QG and evaluation for QG, as well as detailed annotation instructions along with examples when participating in our study.In addition, they gladly volunteered to provide their assistance without being compensated in any form.The candidate questions are anonymized and evaluated in the following aspects: • Question content quality.Following the human criteria elaborated in QG-STEC Task B (Rus et al., 2010), we check whether a question is well-formed, answerable, and relevant to the context.Besides, previous works have shown that pairwise comparison produces a more reliable evaluation than directly asking humans to score the candidate (Amidei et al., 2019;Celikyilmaz et al., 2020).Therefore, we present a context and two questions made by two different models and ask the annotators to choose the better of the two or "tie".Specifically, we first show the annotators a candidate question generated by NQG++ and another one generated by others as well as the corresponding input context and answer text.
After that, we ask the annotators to compare the two questions in terms of grammaticality, answerability, and relevance.To better guide the annotators to distinguish between highquality candidate questions and low-quality ones, we also show the annotators clearing examples as presented in Table 7.
• Skill-controllability.It checks the consistency between the given skill that the question generator is conditioned on and the one chosen by the annotators, i.e. skill accuracy.This kind of fine-grained annotation is inspired by the recent study on the educational question generation (Ghanem et al., 2022) and is used to evaluate the controllability of generation.
Before the annotation, we show the template samples for each comprehension skill as summarized in Table 1 to the annotators.During annotation, they are informed of the annotation instruction in the three steps.
(1) Make a statement using the reference question and gold standard answer.(2) Extract sentences from the context required to support the statement.
(3) Re-read our defined skill-based schema in Table 1 and choose only one required skill to understand an entailment from extracted context to the statement.
• Knowledge quality.Since evaluating the overall quality of knowledge is challenging (Heinzerling and Inui, 2021;West et al., 2022), this aspect checks the groundedness and relevance of our generated knowledge text to the given context.Specifically, we first show the annotators the input context, candidate question, answer text, and corresponding generated knowledge text.After that, we ask the annotator to answer two questions ("does the generated knowledge make sense" and "is the generated knowledge relevant to the input context").Only our SkillQG and knowledgeaugmented baselines are involved with this aspect of evaluation, and the annotation option is either yes or no.We report the percentage of yes answers for the two involved questions described in Section 3.1 .
As shown in Figure 5, we develop a web application to present and collect the human evaluation results automatically.This software can send the candidate samples to the annotators, guide them to evaluate samples from the aforementioned three dimensions and finally post the annotation results to our server.These results are based on the original collection of the dataset and will not violate the rights of individuals and groups.Based on the results, we report the human evaluation results in Section 3.2 and Section D.

D More Experimental Results
We also analyze the quality of generated knowledge and better understand its contribution to the final performance.The human evaluation results on the knowledge quality are summarized in Table 8 and the inter-annotator Krippendorff's α is 88.42, indicating an acceptable level of consistency (> 80%) between annotators (Krippendorff, 2004).The few annotation conflicts are addressed after a discussion among the annotators.The table shows that SkillQG can generate implicit knowledge that makes sense and is pertinent to the context for around 85% of the time as evaluated by human annotators.Compared with other knowledgeaugmented baselines that retrieve knowledge from ConceptNet, SkillQG generates knowledge that is and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?C D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?C D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?C 13850

Figure 3 :
Figure3: Human evaluation results on skill controllability, which is computed by comparing the given with the annotated skill.We depict the accuracy for each skill alongside the horizontal axis.

Figure 4 :
Figure 4: Overall and decomposed performance of the state-of-the-art QA model on the FairytaleQA dataset, augmented with data generated by question generators.

Figure 5 :
Figure 5: A screenshot of our human annotation process.
similar in terms of common sense and has better relevance to the input context.The possible reason behind it is that SkillQG generates knowledge by asking and answering information-seeking questions based on the given context, benefiting the specialization of general knowledge of language models to each sample.C Did you run computational experiments?3 C1.Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?B C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?B C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?B C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?B D Did you use human annotators (e.g., crowdworkers) or research with human participants?3 D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? C D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) involving a single fact with

Why the project under construction will raise Las Vegas' supply of rooms by 20%?
F: What is the relationship between Las Vegas and Clark?F: What is the requirements for a project?Q: K: The relationship between Las Vegas and Clark is that

Las Vegas is situated within Clark County, in a basin on the floor of the Mojave Desert.
K:The requirement for a project is the

Table 2 :
Breakdown statistics of the FairytaleQA dataset and its mapping to our proposed skill-based schema.

Table 3 :
Quantitative results in terms of answerability, syntactic similarity, and relevance evaluation metrics on the FairytaleQA dataset.Please refer to Section 3.1 for the full name of employed metrics.The best result is marked as bold.contentquality,following the human criteria of QG elaborated byRus et al. (

Table 5 :
Quantitative results of ablation experiments.The best result is marked as bold.
and implement p (a | c, s) using heuristic rules.Specifically, REMEMBER and EVALUATE skills, i.e. the narrative elements consisting of character, setting, and feelings, are usually based on How many solo tackles did Von Miller make at Super Bowl?A wins B. B is not grammatically correct.B. What site is locate in the San Franc?What is the axis of Warsaw which divides it into two parts?Context of A. [. ..] the Vistula River is the specific axis of Warsaw, which divides the city into two parts [. ..]B. How big is the greater metropolitan area?Context of B. [. ..] within a greater metropolitan area of 2.666 million residents [. ..]

Table 7 :
Scoring examples for the human evaluation on the question content quality.The problematic words in corresponding candidate questions are marked in red.

Table 8 :
Human evaluation results on knowledge quality.