Controllable Open-ended Question Generation with A New Question Type Ontology

We investigate the less-explored task of generating open-ended questions that are typically answered by multiple sentences. We first define a new question type ontology which differentiates the nuanced nature of questions better than widely used question words. A new dataset with 4,959 questions is labeled based on the new ontology. We then propose a novel question type-aware question generation framework, augmented by a semantic graph representation, to jointly predict question focuses and produce the question. Based on this framework, we further use both exemplars and automatically generated templates to improve controllability and diversity. Experiments on two newly collected large-scale datasets show that our model improves question quality over competitive comparisons based on automatic metrics. Human judges also rate our model outputs highly in answerability, coverage of scope, and overall quality. Finally, our model variants with templates can produce questions with enhanced controllability and diversity.


Introduction
Question-asking has long served as an effective instrument for knowledge learning (Andre, 1979;Tobin, 1990) and assessing learning progress (Holme, 2003;Downing and Yudkowsky, 2009;Livingston, 2009). Compared to the widely studied task of generating factoid questions that inquire about "one bit" of information (Du et al., 2017;Duan et al., 2017;Li et al., 2019), this work is interested in generating open-ended questions that require deep comprehension and long-form answers (Labutov et al., 2015). Such open-ended questions are valuable in education, e.g., to facilitate complex knowledge acquisition (Lai et al., 2017) and nurture reasoning skills (Shapley, 2000), as well as in other applications like improving search engines (Han Input: It's a difficult task to undertake. Teenagers tend to identify gangs with "fitting" in. Peer pressure plays a large part in it and sometimes teenagers have problems with their own identity being part of a gang deals with those issues. It also provides a little bit of respect on the street ...

BART SAMPLING:
-How do you stop a teen from joining a gang? (PROCEDURAL) -How do you get teenagers to stop being in gangs? (PROCEDURAL) -How do you get teens out of gangs? (PROCEDURAL) BART + QWORD: -How do you get a teenager out of a gang? (PROCEDURAL) -What is the best way to get teenagers out of gangs? (PROCEDURAL) -Why do teenagers join gangs? (CAUSE) TPLGEN: -How do I get [NP] to quit being in [NP]? ⇒ ⇒ How do I get my son to quit being in a gang? (PROCEDURAL) -What are [NP]? ⇒ ⇒ What are some programs for teenagers involved in gangs? (EXAMPLE) -Why do [NP] [V] [NP]? ⇒ ⇒ Why do teenagers identify gangs? (CAUSE) Figure 1: Open-ended questions generated by different models after reading the same input: (1) BART decoded with nucleus sampling, (2) BART that considers different question words, and (3) our type-aware generator TPLGEN, that predicts focuses and operates with generated templates (to the left of the arrows). Questions generated by our model have diverse TYPEs. et al., 2019) and building open-domain dialogue systems (Shum et al., 2018).
Significant progress has been made in generating factoid questions (Zhang and Bansal, 2019;Zhou et al., 2019b;Su et al., 2020), yet new challenges need to be addressed for open-ended questions. First, specifying the question type is crucial for constructing meaningful questions (Graesser et al., 1992). Question words such as "why" and "when" are generally seen as being indicative of types (Zhou et al., 2019b), but they underspecify the conceptual content of questions (Olney et al., 2012). Using Figure 1 as an example, different question words, i.e., both "how" and "what", can be used for inquiring about procedures. It thus calls for a new question type ontology that can precisely capture the conceptual nature of questions. Second, constructing questions from a text with multiple sentences needs to focus on its central concepts or phenomena that necessitate extensive descriptions. New representations are needed to capture such content as question focus(es), to go beyond existing methods that rely on entities and their neighboring words (Du et al., 2017;Sun et al., 2018) even though they are effective for generating factoid questions. Third, encouraging the diversity of generated questions (Sultan et al., 2020;Wang et al., 2020) is less explored but critical for real world applications, e.g., various questions should be proposed to gauge how well students grasp the knowledge of complex subjects.
In this work, we aim to address the challenges of generating open-ended questions from input consisting of multiple sentences. We first introduce a new question type ontology, drawn upon researches in cognitive science and psychology (Graesser et al., 1992), to capture deeper levels of cognition, such as causal reasoning and judgments. Based on the new ontology, we collect and annotate a dataset of 4,959 questions to benefit research in both question generation and answering. 1 We then design a type-aware framework to jointly predict question focuses (what to ask about) and generate questions (how to ask it). Different from pipeline-based approaches (e.g., Sun et al. (2018)), our framework is built on large pre-trained BART (Lewis et al., 2020), and uses shared representations to jointly conduct question focus prediction and question generation while learning taskspecific knowledge. It is further augmented by a semantic graph that leverages both semantic roles and dependency relations, facilitating long text comprehension to pinpoint salient concepts.
Moreover, to achieve the goal of producing various types of questions from the same input, we investigate two model variants that use templates to improve controllability and generation diversity: one using pre-identified exemplars, the other employing generated templates to guide question writing, with sample outputs displayed in Figure 1.
For experiments, we collect two new large-scale datasets consisting of open-ended questions with answers from (1) Yahoo Answers 2 L6 dataset and (2) popular question-asking communities on Reddit 3 , consisting of 291K and 720K question-answer pairs, respectively. Compared to existing popular QA datasets, such as SQuAD (Rajpurkar et al., 2016) and MS MARCO (Bajaj et al., 2016)), questions in our datasets ask about complex phenomena and perplexing social issues that seek solutions expressed in a long form. Automatic metrics show that our type-aware question generation model outperforms competitive comparisons, highlighting the effectiveness of semantic graph-augmented representation and joint modeling of focus prediction and question generation. Human judges also confirm that questions generated by our model have better overall quality. Adding templates further promotes question diversity, as evaluated by both automatic evaluation and human assessment.

Related Work
Question generation has long been studied to reduce human efforts in constructing questions for knowledge learning evaluation (Mitkov and Ha, 2003;Brown et al., 2005). Early work relies on syntactic transformation to convert declarative sentences to questions (Heilman and Smith, 2010;Chali and Hasan, 2015). Recent advancements rely on sequence-to-sequence models to generate a question from a given sentence or paragraph by considering the focus, type, and general-specific relations of questions (Sun et al., 2018;Zhou et al., 2019b;Krishna and Iyyer, 2019). In particular, question likelihoods and rewards are designed to steer them toward being addressed by the given answers (Zhou et al., 2019a;Zhang and Bansal, 2019). Attempts are also made toward creating complex questions that require multi-hop reasoning over the given text, and graph-based representations have been an enabling tool to facilitate the access to both entities and relations (Pan et al., 2020;Su et al., 2020). While our model also enhances the input with a semantic graph, it boasts a richer representation by including both dependency and semantic relations, with predicted question focuses highlighted via extra node embeddings. Moreover, we create a separate layer of cross attentions that is dedicated to the semantic graph, while prior work uses the same set of attentions to attend to the concatenated text and graph representations.
Given the data-driven nature of question generation and answering tasks, recent studies take advantage of the availability of large-scale QA datasets, such as SQuAD (Rajpurkar et al., 2016), MS MARCO (Bajaj et al., 2016), HotpotQA (Yang et al., 2018), DROP (Dua et al., 2019), inter alia. These corpora mainly contain factoid questions, while our newly collected datasets are not only larger in size but also comprise significantly more open-ended questions for querying reasons and procedures. A dataset closer to ours is ELI5 (Fan et al., 2019), which also obtains open-ended questionanswer pairs from Reddit, while one of our datasets includes more Reddit communities and thus covers a wider range of topics.
Our work is more inline with generating deeper questions with responses that span over multiple sentences, where manually constructed templates are found effective (Olney et al., 2012). For example, Labutov et al. (2015) use crowdsourcing to collect question templates based on an ontology derived from Wikipedia and Freebase topics. Different from the topic-based ontology, our question types are more aligned with cognitive levels. Moreover, our templates are automatically learned from training data. Recent work Daumé III, 2018, 2019) focuses on asking clarification questions based on both retrieval and generation models. As there has been no suitable framework for diverse types of questions, this work aims to fill the gap by introducing type-aware generation models which optionally leverage question templates for better controllability.
Generating diverse questions is much less studied, with existing approaches mainly focusing on entity replacement (Cho et al., 2019), sampling decoding (Sultan et al., 2020;Wang et al., 2020), and post-filtering (Liu et al., 2020). However, the produced diversity is driven by word choice and syntax variation, with little ability to control on question types, which is the focus of this work.

Open-ended Question Datasets
To collect open-ended questions, we resort to online forums with active question-asking discussions. Concretely, we gather and clean question-answer pairs from Reddit and Yahoo Answers, to train generators that construct questions by taking the corresponding answer as input. PROCEDURAL the procedures, tools, or methods by which a certain outcome is achieved.
JUDGMENTAL the opinions of the answerer's own. We choose five popular Reddit communities: r/AskHistorians, r/Ask Politics, r/askscience, r/explainlikeimfive, and r/AskReddit, where open-ended questions are actively asked. The original posts (OPs) are extracted, with their titles becoming questions. We also keep the best answer with the highest karma (i.e., upvotes minus downvotes) if it is greater than 1. A second dataset with question-answer pairs is collected from the Yahoo Answers L6 corpus 4 , which covers a broader range of topics than the Reddit data. For each question, the best answer is rated by the user who raises the question.
Preprocessing. To ensure both questions and answers are well-formed, human inspection is conducted in multiple iterations to design rules to filter out improper samples. For instance, we discard samples whose answers have less than 15 content words to avoid the inclusion of factoid question. More details are provided in Table 6 in Appendix A. Ultimately, 719,988 question-answer pairs are kept for Reddit, and 290,611 for Yahoo. Each dataset is then divided into train, validation and test sets with a 90%/5%/5% split. The average lengths of questions and answers are 14.5 and 117.8 for Reddit, and 12.2 and 123.6 for Yahoo.

Question Type Ontology and Annotation
Our question type ontology is adopted and modified from Olney et al. (2012), where 18 categories are originally proposed for knowledge learning as-sessment. We recruited 6 native English speakers for three rounds of question type annotation. Based on the annotators' feedback after each round, we refine the definitions, merge ambiguous types, and delete inapplicable categories. For example, an initial EXPECTATION type is merged into CAUSE due to their similarities in seeking causality. Finally, 10 types are preserved ( Table 1). As can be seen, our ontology is designed to better capture the nature of questions than question words.
Annotating Questions with Types. After the annotation guideline is finalized, we ask the same set of annotators to label 5,000 (2 × 2,500) randomly sampled questions from both Reddit and Yahoo's training sets. Each question is labeled by two annotators, with disagreements resolved through discussions. After removing samples without consensus, the final dataset consists of 4, 959 questions. EXAMPLE questions are most prevalent, comprising 23.4% of samples, while only 2.6% are CONSEQUENCE questions. A Krippendorff's α of 0.67 is obtained for all samples, indicating a reasonable agreement level. The annotation guideline and examples for each question type are shown in Table 12 in Appendix A.
Training Question Type Classifiers. Since our type-aware question generation model requires a specified type as input, here we describe how to build two question type classifiers: (1) γ q , that labels a type by reading the question and is used to provide question type labels during training; (2) γ a , that predicts a type for use by taking the answer as input and is used during test.
Both classifiers are based on RoBERTa (Liu et al., 2019), where a prediction layer is built on top of the contextual representation of the [BOS] token to output question type probabilities. γ q achieves a macro F1 score of 0.80 on a reserved test set, with data splits detailed in Appendix B. To train γ a , in addition to the annotated questions, we run γ q on unlabeled questions in Reddit and Yahoo and include samples whose type prediction confidence score is above 0.9. We train one γ a for each dataset. γ a obtains macro F1 scores of 0.48 and 0.46 on the same reserved test set over all types after training on Yahoo and Reddit, respectively.
After running γ q on both datasets, we find that Reddit has significantly more EXAMPLE questions (43.8% of all samples). Yahoo dataset is more balanced, with PROCEDURAL questions being the most frequent type (19.9% of all samples). Distri-  Detecting question focuses (nodes in darker color) and generating questions (or templates) are jointly learned. We only show a partial semantic graph. Special tokens are also inserted to segment different parts of the input. JOINTGEN uses the type and the answer for question generation; EXPLGEN further considers an exemplar; and TPLGEN uses all three for template generation.

Focus Prediction
butions of question types for the two datasets are listed in Table 8 in Appendix B.

Type-aware Open-ended Question Generation
In this section, we present our type-aware question generation framework. As shown in Figure 2, our model takes in a multi-sentence text and a predicted question type. Built on shared input representations, it first detects question focuses from a semantic graph, and then generates the question ( § 4.1).
We also propose two model variants that consider automatically extracted template exemplars or generated templates to achieve controllability ( § 4.2), enabling the generation of diverse questions.

Joint Focus Prediction and Question Generation (JOINTGEN)
Our generator is built on top of BART (Lewis et al., 2020). To facilitate the detection of salient content (i.e., focuses) to raise questions, we first augment the encoder with a semantic graph that consists of both dependency relations and semantic roles, capturing semantic relations over different scopes with varying granularities. Question focuses are first detected based on the semantic graph, which then guide question generation via cross-attentions, as shown in Figure 2. Although the joint modeling of focus prediction and question generation has been studied before, our design differs by using shared representations consisting of the input text and semantic graph, and the prediction of focuses are included through gating mechanisms, whereas previous work, e.g. Pan et al. (2020), simply employs multi-task learning. Below, we first describe constructing the semantic graph-augmented encoder, followed by the joint modeling of two tasks. Improving Long Text Comprehension with Semantic Graph. To construct the semantic graph, for each sentence, we start with obtaining its dependency tree using Stanford CoreNLP (Manning et al., 2014). To better highlight core concepts, we discard less important relations, e.g., auxiliaries.
The full list is included in Appendix C. Since our goal is to detect central concepts that are well connected with many other words, we can remove relations on the edges to minimize the number of parameters to learn. Moreover, as semantic roles can indicate main entities (Mannem et al., 2010), we extract semantic roles and their relations with AllenNLP (Shi and Lin, 2019). To merge the two sources of information, we add an edge in the dependency tree to connect the head word of the predicate and the head word of each semantic role. To build a connected graph from the multi-sentence input, we add an edge between each sentence's last token and the next sentence's first token. Finally, we merge nodes with the same surface forms or with corefered mentions. To the best of our knowledge, this is the first time that both dependency and semantic relations are encoded in the same graph for question generation, and with enhanced connectivity of the constructed graph, our design can better signal content salience. Joint Modeling with Cross-attentions. Given a predicted question type t and a multi-sentence text x = {x 1 , · · · , x n }, the BART encoder builds the contextual representation H = {h 0 , h 1 , · · · , h n } at the last layer, where h 0 is for t.
To encode the semantic graph, we initialize the node representation for node v i by taking the average contextual representations of its tokens and appending four bits encoding the number of nodes (capped at 10) that are merged into v i , to add frequency information. This step yields new node representations v (0) i . We then apply graph attention networks (GATs) (Veličković et al., 2018) of L layers to update the representations as follows: where W (l) is a learnable parameter for the l-th layer, and N i denotes the neighbors of v i . The attention score a i,j is calculated as in GATs. We use L = 2 for experiments.
To predict focuses, the final node representation v (L) i is fed into the following feedforward network, yielding the probability of v i being a focus as: where W 1 and W 2 are learnable parameters. Bias terms are omitted for simplicity. We construct ground-truth labels by treating a node as a focus if it contains words used in the question.
To generate the question, we use the gating mechanism to inform the focus prediction results, where new node representations after being weighted by the focus probability are: Our model benefits from both large pre-training and hybrid semantic graphs by adding a separate cross attention for node presentations in each BART decoder layer. We then design separate cross attentions to attend (1) the output of the BART encoder, yielding z e , and (2) the node representations V (L) , producing z v , which are formulated as: where z s denotes the output of self attentions for the current layer, and z is the output for the layer. Attn(·, ·) denotes the multi-head attention operation as in Vaswani et al. (2017), FFN(·) a feedforward layer, and LN(·) is layer normalization. Our final training objective accounts for both focus prediction and question generation objectives with equal weights.

Diversifying Questions with Templates (EXPLGEN & TPLGEN)
An important goal of this work is to enable the generation of questions of diverse types. However, simply adding question type as input is insufficient (discussed in § 5). We thus propose to leverage question templates to gain stronger controllability. Below we first present how to automatically extract templates from the training set, and then introduce two model variants that are built on the JOINTGEN framework: EXPLGEN uses exemplar templates to guide the model to generate questions of selected types, and TPLGEN adds an extra step to first generate type-specific templates.
Template Extraction. While collecting templates specific to a given type, we need to ensure they remain topic-independent to be generalizable to different domains. To this end, we replace a word in the question with a template token that indicates its syntax function, e.g., [V] for a verb, if it appears in the answer after lemmatization. We further consider topically related words in the questions, by calculating word-level semantic similarities based on Numberbatch word embeddings (Speer et al., 2017), which are found to perform better on our datasets than other embeddings. Concretely, for each word in the answer, we replace the most similar word in the question with the template token. This process is repeated until 80% of content words in questions are replaced. Finally, for each noun phrase, adjective phrase, and adverb phrase, if its head word has been replaced, the whole phrase is transformed into a phrase type token. For instance, a question "What are the differences between global warming and climate change?" becomes "What are the differences between [NP] and [NP]?" Exemplars for Guidance (EXPLGEN). Our first model variant considers adding a template exemplar for the given type as additional input, which provide more specific information to control the type of generated questions. Figure 2 shows one such example. To identify exemplars, we use templates with frequencies above 20 on Yahoo and 50 on Reddit. We then manually inspect these templates and remove the ones with topic-specific words, resulting in 66 exemplars for all types. They are listed in Table 10 in Appendix D.
During training, we choose the exemplar that has the lowest edit distance with the question, which is also used for training an exemplar selector based on RoBERTa. During testing, the exemplar with the highest selector score is used. The accuracy of the exemplar selector for each question type on the test set is reported in Table 11 in Appendix D.

Generated Templates for Guidance (TPLGEN).
We further propose another model variant where we generate a new template and feed it (instead of an exemplar template as in EXPLGEN) as part of the question generation input. Specifically, we reuse EXPLGEN to learn to generate a target template, as derived from the template extraction procedure. During question realization, TPLGEN uses a BART-based generator that takes as input the question type, the input text, the generated template, and the words that are predicted as focuses. We use separate cross attentions to attend the representations of the focused words, similar to how node representations are attended in JOINTGEN.
We recognize that having separate stages of exemplar selection and template generation introduces extra model training cost and potential errors in the pipeline. This work, however, focuses on improving the controllability as well as diversity of question generation, and we will leave the building of more efficient models in the future work.

Automatic Evaluation
Comparisons and Metrics. We compare with DEEPQG (Pan et al., 2020), a model that uses dependency graphs for multi-hop question generation. We also compare with BART models that are finetuned on the same datasets as in our models, by using inputs of (1) the answer (BART), (2) the answer and a predicted question word (BART+QWORD), and (3) the answer and a predicted question type (BART+QTYPE). For BART+QWORD, the question word is predicted by a RoBERTa classifier that considers the answer and is trained on our training sets. We follow Liu et al. (2020) and use 9 categories of question words. For both our models and BART+QTYPE, the most confident type predicted by the classifier γ a (described in § 3.2), which reads in the answer, is used as input. To test the efficacy of semantic graphs, we further compare with a variant of JOINTGEN that only uses the flat Transformer for focus prediction and question generation, denoted as JOINTGEN w/o graph.
We evaluate the generated questions with BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), and ROUGE-L (Lin, 2004). 5 Results on both Yahoo and Reddit datasets are reported in Table 2. Our JOINTGEN outperforms all comparisons on both datasets over all automatic evaluation metrics except for METEOR on Reddit. When taking out the semantic graphs, model performance degrades substantially, which suggests that  Table 2: Automatic evaluation results on Yahoo and Reddit with BLEU-4 (B-4), METEOR (MTR) and ROUGE-L (R-L). JOINTGEN outperforms comparisons over all metrics except for METEOR on Reddit, but removing its graph degrades performance. We only report results by DEEPQG on Yahoo due to memory limitation. * : significantly better than BART (p < 0.005 with approximate randomization test).  Table 3: Automatic evaluation on controllability and diversity by specifying 9 different question types. We report type accuracy (Acc), number of unique types (UnT), and pairwise BLEU-4 (Pair). Our EXPLGEN and TPLGEN achieve stronger controllability by respecting the given question types more, as well as show higher diversity than comparisons except for BART with nucleus sampling. * : significantly better than all comparisons (p < 0.005).
having structured representation is useful for focus detection and the final question generation task. We also observe a huge performance gap between DEEPQG and systems based on BART, signifying the importance of leveraging pre-trained models for open-ended question generation. Meanwhile, adding question types helps BART generate more relevant questions than using question words, indicating the value of our new question type ontology. Notably, our template-based generators, EX-PLGEN and TPLGEN, which are trained to comply with the given templates, still produce comparable scores. This highlights the possibility to control the generated questions' types and syntax as demonstrated by the templates, without performance loss.
Question Diversity Evaluation. Next, we exam- ine the controllability of models by specifying different question types as input. The top 9 confident types 6 predicted by our type predictor γ a are used as input to our models, producing 9 questions for evaluation. For BART, we use nucleus sampling (Holtzman et al., 2020) with k = 10 and p = 0.7 to sample diverse questions.
To evaluate, we first calculate the question type accuracy by comparing whether the types of the generated questions match the specified ones, with types labeled by our classifier γ q ( § 3.2). We then report the average numbers of unique question types in the 9 generated questions per sample, with higher number indicating better controllability. Finally, we consider pairwise BLEU-4 (Cho et al., 2019) by computing the BLEU-4 between pairwise generated questions per sample, where lower values suggest higher content diversity.
First, our EXPLGEN and TPLGEN can generate questions with diverse types and content, as shown by the significantly higher numbers of unique types than all comparisons and lower pairwise BLEU scores than comparisons except for BART with nucleus sampling in Table 3. This implies stronger type control by template-based generators, compared to BART+QTYPE and JOINTGEN which only use the question type token as input. Results on numbers of unique types by varying numbers of question types specified in the input are displayed in Figure 3, where EXPLGEN and TPLGEN maintain steady controllability.
Second, our question type ontology provides a new perspective for question diversity evaluation. Among the comparisons, although BART with nucleus sampling and BART+QWORD both have low pairwise BLEU, the types of questions they can generate are limited.

Human Evaluation
Question Diversity. We hire three annotators who have participated in our question type annotation study to evaluate 80 groups of questions generated by four selected models on each dataset. For each group, we randomly sample an answer and indicate three most probably question types to each model, to generate three corresponding questions. For each sample, the annotators are asked to rank the four models from 1 (highest) to 4 (lowest) on three aspects of diversities: type-whether the three generated questions have different types, syntax-whether they use different syntax, and answer content-whether the three questions need to be addressed with different answers. Ties are allowed.
We find that human judges rate questions generated by our EXPLGEN and TPLGEN as having greater diversities over all aspects, except for syntax diversity on Reddit, as shown in Table 4. Among the two model variants, questions by TPLGEN yield more diverse answers. Based on our observation, TPLGEN uses automatically generated templates to produce more focused questions with different answers, compared to EXPLGEN which employs exemplars. This shows the promise of using automatically generated templates to create questions that need to be addressed with different answers. Besides Figure 1, we show more sample outputs in Figure 4, where EXPLGEN and TPLGEN exhibit stronger controllability than JOINTGEN. Question Content Quality. We use the same set of human judges to evaluate another 80 groups of questions output by five selected models and the reference. Three aspects are rated from 1 (worst) Answer: My sister in law and her husband "genetically modified" their second child because the first has EB. They eliminated that and had a baby that gets to live pain free. Under the right circumstances, I'm all for it ...  Figure 4: Sample outputs of our models given different question types. Spans that belong to the exemplars or the generated templates are colored with blue. Generated questions that do not match the given type are marked by strikethrough.

JOINTGEN
to 5 (best): appropriateness-whether the question is semantically correct, without considering the answer; answerability-whether the question can be addressed by the given answer; and scopewhether the question is related to a longer span of the answer (global scope) or focuses on local content (e.g., one phrase or one sentence). We further ask the annotators to rank questions based on their overall quality and preferences, with ties allowed.
As shown in Table 5, our JOINTGEN model produces questions with better answerability and that cover broader content in the answers. It is also rated as the best in more than half of the evaluation instances on both datasets. Between BART+QWORD and BART+QTYPE, human judges rate the system outputs that conditioned on our question types to have better overall quality.

Further Analyses
Does focus prediction correlate with question quality? We first investigate the relationship between focus prediction and question generation by using our joint model JOINTGEN. As can be seen from Figure 5, there is a strong correlation between F1 scores of focus prediction and BLEU-4 as well  We also show the F1 scores and BLEU-4 for selected question types on the right of Figure 5, again demonstrating the effect of focus detection on question quality.
When do our models fail to respect the given types? Next, we provide insights into which types of questions are challenging to generate by using our template-based models EXPLGEN and TPLGEN. Both variants frequently fail to respect the given question type of VERIFICATION, in which cases they often produce JUDGEMENTAL questions. They also tend to confuse EXAMPLE and EXTENT with CONCEPT questions. After manually inspecting 50 generated questions for the aforementioned three types, we find that many of them can be labeled with both types, thus creating confusion for our classifier. For instance, "What are the import restrictions in the US?" can be considered as either asking for a definition or for examples. Therefore, future work should include designing multi-class type identification models.

Conclusion
We present a new question type ontology which better captures the nuances of questions to support the study of open-ended question generation. We further annotate a new dataset with 4,959 questions based on the proposed ontology. We describe a joint question focus detection and question generation framework with a novel semantic graphaugmented representation, which is directly built on large pre-trained models. Based on this framework, we also enhance the controllability and diversity of generated questions by employing template exemplars or automatically generated templates. Experiments on two large datasets show that questions generated by our models have better quality and higher diversity than non-trivial comparisons, with similar results rated by human judges.

Ethics Statement
Large models that are pre-trained on heterogeneous web data are shown to encode biases and can be potentially harmful for marginalized populations. While the automatically learned templates improve controllability in question generation, we also recognize that our system might be misused to create questions that contain objectionable content. We therefore advocate cautious and responsible practices in real-world deployment.
Our data collection process for the two new datasets involves removing samples with abusive languages and human inspection on random samples. Given the data volume, however, we cannot exhaustively verify that all records are free of potentially offensive content.

A Data Collection
Data Filtering. After collecting the raw data from Yahoo and Reddit, we design rules to filter out ill-formed answers and questions. These rules are listed in Table 6. Finally, we conduct human inspection on random samples from the two datasets and confirm that samples are all clean and contain open-ended questions.

Rules for Data Cleaning
-The question has URL links.
-The question has more than 1 sentence or does not end with a question mark.
-The question has less than 4 words or less than 1 content word.
-The question does not start with wh-words: what, why, how, which, where, who, when; yes-no words: is, are, was, were, will, would, do, does, did, can, could, should, has, have; or frequent words for conditions: if, in, for, to, as, at.
-The answer has less than 15 content words.
-The answer has less content words than the question.
-The answer has more than 30% of the words as digit letters.
-The question and the answer have less than 2 overlapping content words.
-The question or the answer contains abusive words from Google's "what do you need" project 7 .
-The question or the answer has emoticons 8 .
-The question or the answer has 3 consecutive punctuation.
-The question or the answer has 3 consecutive fully uppercased words.
-The question has more than 90% of title-case words or the answer has more than 30% of title-case words.
-The question has more than 1 unique word not in the English dictionary or the answer has more than 2 unique words not in the English dictionary 9 . For samples with disagreed labels, we check whether agreement can be reached by considering both labeled types. For example, if annotator A labels VERIFICATION and JUDGMENTAL, and annotator B labels JUDGMENTAL, the agreed-upon type is JUDGMENTAL. We then resolve outstanding disagreements by discussion.

B Details for Question Type Classifiers
To train the question type classifier γ q that reads the question as input, we split the collected question type dataset into training, validation, and test sets. Sample counts and question type distributions for different data splits are shown in Table 7.  We then use γ q to identify types for unlabeled questions in Yahoo and Reddit. The question type distributions for the two datasets are shown in Ta

D Details for Templates and Exemplars
Template Construction. To avoid replacing words that are representative of question types during template construction, we maintain a list of words not to be replaced for each question type, as shown in Table 9. These words are identified by frequency with additional manual inspection.

E Details for Implementation
We use Fairseq  to build our models and conduct training and decoding. For the Graph Attention Networks (GATs) in our focus predictor, we adopt the implementation by PyTorch Geometric (Fey and Lenssen, 2019). All our experiments are conducted on a Quadro RTX 8000 GPU with 48 GB of memory.
Training Settings. We use Adam (Kingma and Ba, 2014) for the training of all our models. Our question type classifiers and template exemplar classifiers are trained with a maximum learning rate of 1 × 10 −5 and a batch size of 32. For training generation models, the maximum learning rate is 3 × 10 −5 and each batch contains at most 32,768  models except for models with GATs.
Decoding Settings. We use beam search for decoding. A beam size of 5 and a length penalty of 1.5 are used for all models. Repeated trigram blocking is applied to question generation. The minimum and maximum lengths for generation are set to 1 and 100, respectively.
Model Parameters. The question type classifiers and template exemplar classifiers are based on RoBERTa Large , which has 355M parameters. Our generation model builds a GAT upon the BART model, containing 430M parameters in total.
Running Time. Training question type classifiers takes 23 hours. Due to the difference in training data size, the training time for template exemplar classifiers ranges from 20 minutes to 3 hours. For our generation model with focus prediction, it takes 6 hours to train on Yahoo and 12 hours to train on Reddit. Decoding on the test set of Yahoo and Reddit takes 8 minutes and 15 minutes, respectively.
In this study, you are asked to annotate the question types for 1000 questions. The question type reflects the nature of the question. It is not determined by the interrogative word of the question. There are 10 question types in total. The definition for each type is shown in the following Table, along with examples per question type. During annotation, you can label two most-confident types when no clear decision can be made for the most probable type. VERIFICATION: Asking for the truthfulness of an event or a concept.
-"Is Michael Jackson an African American?" -"Does a Mercedes dealer have to unlock a locked radio?" -"Could stress, anxiety, or worry cause cholesterol levels to rise?" DISJUNCTIVE: Asking for the true one given multiple events or concepts, where comparison among options is not needed.
-"Is Michael Jackson an African American or Latino?" -"Is a DVI to HDMI cable supposed to transmit audio and video or just video?" -"When you get a spray-on tan does someone put it on you or does a machine do it?" CONCEPT: Asking for a definition of an event or a concept.
-"Who said the sun never sets on the British empire?" -"Where do dolphins have hair at?" -"What is the origin of the phrase "kicking the bucket"?" EXTENT: Asking for the extent or quantity of an event or a concept.
-"How long does gum stay in your system?" -"What is Barry Larkin's hat size?" -"To what extent is the Renewable Fuel Standard accurate nationwide?" EXAMPLE: Asking for example(s) or instance(s) of an event or a concept.
-"What are some examples to support or contradict this?" -"Where can I get my teeth examined around Los Angeles?" -"What countries/regions throughout the world do not celebrate the Christmas holidays?" -"What is the best goal or win you have ever made in a sport?" COMPARISON: Asking for comparison among multiple events or concepts.
-"How does an electric violin "play" differently than an acoustic violin?" -"What is the best tinted facial moisturizer?" -"In what hilariously inaccurate ways is your job/career portrayed on television or in movies?" -"Which is better, Nike or Adidas?" CAUSE: Asking for the cause or reason for an event or a concept.
-"How does the D.M.V. decide the first letter of the California driver's license?" -"Why are parents strick on girls than boys?" -"What makes nerve agents like "Novichok" so hard to produce and why can only a handful of laboratories create them?" "Why is the sky blue?" CONSEQUENCE: Asking for the consequences or results of an event.
-"What are the negative consequences for the services if they do not evaluate their programs?" -"In the US, what is the benefit of having a red left-turn arrow?" -"What would happen if employers violate the legislation?" -"What if the Hokey Pokey is really what it's all about?" PROCEDURAL: Asking for the procedures, tools, or methods by which a certain outcome is achieved.
-"Why YM 7.5 BETA always stupidly shows me available, although I initially set it to invisible?" -"How did the Amish resist assimilation into the current social status in the U.S?" -"How astronomers detect a nebula when there are no stars illuminating it?" JUDGMENTAL: Asking for the opinions of the answerer's own.
-"Do you think that it's acceptable to call off work for a dying-dead pet?" -"Should I date a guy that has an identical twin?" -"How old is too old for a guy to still live with his mother?"