Improving Question Generation with Multi-level Content Planning

This paper addresses the problem of generating questions from a given context and an answer, specifically focusing on questions that require multi-hop reasoning across an extended context. Previous studies have suggested that key phrase selection is essential for question generation (QG), yet it is still challenging to connect such disjointed phrases into meaningful questions, particularly for long context. To mitigate this issue, we propose MultiFactor, a novel QG framework based on multi-level content planning. Specifically, MultiFactor includes two components: FA-model, which simultaneously selects key phrases and generates full answers, and Q-model which takes the generated full answer as an additional input to generate questions. Here, full answer generation is introduced to connect the short answer with the selected key phrases, thus forming an answer-aware summary to facilitate QG. Both FA-model and Q-model are formalized as simple-yet-effective Phrase-Enhanced Transformers, our joint model for phrase selection and text generation. Experimental results show that our method outperforms strong baselines on two popular QG datasets. Our code is available at https://github.com/zeaver/MultiFactor.


Introduction
Question Generation (QG) is a crucial task in the field of Natural Language Processing (NLP) which focuses on creating human-like questions based on a given source context and a specific answer.In recent years, QG has gained considerable attention from both academic and industrial communities due to its potential applications in question answering (Duan et al., 2017), machine reading comprehension (Du et al., 2017), and automatic conversation (Pan et al., 2019;Ling et al., 2020).
Effective content planning is essential for QG sys- tems to enhance the quality of the output questions.This task is particularly important for generating complex questions, that require reasoning over long context.Based on the content granularity, prior research (Zhang et al., 2021) can be broadly categorized into two groups: phrase-level and sentence-level content planning.On one hand, the majority of prior work (Sun et al., 2018;Liu et al., 2019;Pan et al., 2020;Cao and Wang, 2021;Fei et al., 2022;Subramanian et al., 2018) has focused on phrase-level planning, where the system identifies key phrases in the context and generates questions based on them.For instance, given the answer "28 January 1864" and a two-paragraphs context in Figure 1, we can recognize "English Inventor," "the oil engine," "Herbert Akroyd Stuart" as important text for generating questions.For long context, however, it is still challenging for machines to connect such disjointed facts to form meaningful questions.On the other hand, sentence-level content planning, as demonstrated by Du and Cardie (2017), aims at automatic sentence selection to reduce the context length.For instance, given the sample in Figure 1, one can choose the underscored sentences to facilitate QG.Unfortunately, it is observable that the selected sentences still contain redundant information that may negatively impact question generation.Therefore, we believe that an effective automatic content planning at both the phrase and the sentence levels is crucial for generating questions.
In this paper, we investigate a novel framework, MultiFactor, based on multi-level content planning for QG.At the fine-grained level, answer-aware phrases are selected as the focus for downstream QG.At the coarse-grained level, a full answer generation is trained to connect such (disjointed) phrases and form a complete sentence.Intuitively, a full answer can be regarded as an answer-aware summary of the context, from which complex questions are more conveniently generated.As shown in Figure 1, MultiFactor is able to connect the short answer with the selected phrases, and thus create a question that requires more hops of reasoning compared to Vanilla QG.It is also notable that we follow a generative approach instead of a selection approach (Du and Cardie, 2017) to sentence-level content planning.Figure 1 demonstrates that our generated full answer contains more focused information than the selected (underscored) sentences.
Specifically, MultiFactor includes two components: 1) A FA-model that simultaneously selects key phrases and generate full answers; and 2) A Qmodel that takes the generated full answer as an additional input for QG.To realize these components, we propose Phrase-Enhanced Transformer (PET), where the phrase selection is regarded as a joint task with the generation task both in FA-model and Q-model.Here, the phrase selection model and the generation model share the Transformer encoder, enabling better representation learning for both tasks.The selected phrase probabilities are then used to bias to the Transformer Decoder to focus more on the answer-aware phrases.In general, PET is simple yet effective as we can leverage the power of pretrained language models for both the phrase selection and the generation tasks.
Our main contributions are summarized as follows: • To our knowledge, we are the first to introduce the concept of full answers in an attempt of multi-level content planning for QG.As such, our study helps shed light on the influence of the answer-aware summary on QG.
• We design our MultiFactor framework following a simple yet effective pipeline of Phraseenhanced Transformers (PET), which jointly model the phrase selection task and the text generation task.Leveraging the power of pretrained language models, PET achieves high effectiveness while keeping the additional number of parameters fairly low in comparison to the base model.
• Experimental results validate the effectiveness of MultiFactor on two settings of HotpotQA, a popular benchmark on multi-hop QG, and SQuAD 1.1, a dataset with shorter context.

Related Work
Early Question Generation (QG) systems (Mostow and Chen, 2009;Chali and Hasan, 2012;Heilman, 2011) followed a rule-based approach.This approach, however, suffers from a number of issues, such as poor generalization and high-maintenance costs.With the introduction of large QA datasets such as SQuAD (Rajpurkar et al., 2016) and Hot-potQA (Yang et al., 2018), the neural-based approach has become the mainstream in recent years.
In general, these methods formalize QG as a sequence-to-sequence problem (Du et al., 2017), on which a number of innovations have been made from the following perspectives.
Enhanced Input Representation Recent question generation (QG) systems have used auxiliary information to improve the representation of the input sequence.For example, Du et al. (2017) used paragraph embeddings to enhance the input sentence embedding.Du and Cardie (2018) further improved input sentence encoding by incorporating co-reference chain information within preceding sentences.Other studies (Su et al., 2020;Pan et al., 2020;Fei et al., 2021;Sachan et al., 2020a) enhanced input encoding by incorporating semantic relationships, which are obtained by extracting a semantic or entity graph from the corresponding passage, and then applying graph attention networks (GATs) (Veličković et al., 2018).
One of the challenges in QG is that the model might generate answer-irrelevant questions, such as pro-ducing inappropriate question words for a given answer.To overcome this issue, different strategies have been proposed to effectively exploit answer information for input representation.For example, Zhou et al. (2017); Zhao et al. (2018); Liu et al. (2019) marked the answer location in the input passage.Meanwhile, Song et al. (2018); Chen et al.
(2020) exploited complex passage-answer interaction strategies.Kim et al. (2019); Sun et al. (2018), on the other hand, sought to avoid answer-included questions by using separating encoders for answers and passages.Compared to these works, we also aim to make better use of answer information but we do so from the new perspective of full answers.

Content Planning
The purpose of content planning is to identify essential information from context.Content planning is widely used in in text generation tasks such as QA/QG, dialogue system (Fu et al., 2022;Zhang et al., 2023;Gou et al., 2023), and summarization (Chen et al., 2022).Previous studies (Sun et al., 2018;Liu et al., 2019) predicted "clue" words based on their proximity to the answer.This approach works well for simple QG from short contexts.For more complex questions that require reasoning from multiple sentences, researchers selected entire sentences from the input (documents, paragraphs) as the focus for QG, as in the study conducted by Du and Cardie (2017).Nevertheless, coarse-grained content planning at the sentence level may include irrelevant information.Therefore, recent studies (Pan et al., 2020;Fei et al., 2021Fei et al., , 2022) ) have focused on obtaining finer-grained information at the phrase level for question generation.In these studies, semantic graphs are first constructed through dependency parsing or information extraction tools.Then, a node classification module is leveraged to choose essential nodes (phrases) for question generation.
Our study focuses on content planning for Question Generation (QG) but differs from previous studies in several ways.Firstly, we target automatic content-planning at both the fine-grained level of phrases and the coarse-grained level of sentences.
As far as we know, we are the first that consider multiple levels of granularity for automatic content planning.Secondly, we propose a novel phraseenhanced transformer (PET) which is a simple yet effective for phrase-level content planning.Compared to Graph-based methods, PET is relatively simpler as it eliminates the need for semantic graph construction.In addition, PET is able to leverage the power of pre-trained language models for its effectiveness.Thirdly, we perform content planning at the sentence level by following the generative approach instead of the extraction approach as presented in the study by Du and Cardie (2017).The example in Figure 1 shows that our generated full answer contains less redundant information than selecting entire sentences of supported facts.
Diversity While the majority of previous studies focus on generating context-relevant questions, recent studies (Cho et al., 2019;Wang et al., 2020b;Fan et al., 2018;Narayan et al., 2022) have sought to improve diversity of QG.Although we not yet consider the diversity issue, our framework provides a convenient way to improve diversity while maintaining consistency.For example, one can perform diverse phrase selection or look for diverse ways to turn full answers into questions.At the same time, different strategies can be used to make sure that the full answer is faithful to the given context, thus improving the consistency.

MultiFactor Question Generation
Given a source context c = [w 1 , w 2 , . . ., w Tc ] and an answer a = [a 1 , a 2 , . . ., a Ta ], the objective is to generate a relevant question q = [q 1 , q 2 , . . ., q Tq ]; where T c , T a , and T q denote the number of tokens in c, a and q, respectively.It is presumed that we can generate full answers s = [s 1 , s 2 , . . ., a Ts ] of T s tokens, thus obtaining answer-relevant summaries of the context.The full answers are subsequently used for generating questions as follows: where Q model and FA model refer to the question generation and the full answer generation models, respectively.Each Q-model and FA-model is formalized as a Phrase-enhanced Transformer (PET), our proposal for text generation with phrase planning.In the following, we denote a PET as ϕ : x → y, where x is the input sequence and y is the output sequence.For the FA-model, the input sequence is x = c ⊕ a and the output is the full answer s, where ⊕ indicates string concatenation.
As for the Q-model, the input is x = c ⊕ a ⊕ s with s being the best full answer from FA-model, and the output is the question q.The PET model ϕ firsts select phrases that can consistently be used to generate the output, then integrates the phrase probabilities as soft constraints for the decoder to do generation.The overview of MultiFactor is demonstrated in Figure 2. The Phrase-Enhanced Transformer is detailed in the following section.

Phrase Enhanced Transformer
We propose Phrase-enhanced Transformer (PET), a simple yet effective Transformer-based model to infuse phrase selection probability from encoder into decoder to improve question generation.
Formally, given the input sequence x, and L phrase candidates, the i-th phrase } extracted from the context x, where l i j indicates the index of token j of the i-th phrase in x.The phraselevel content planning is formalized as assigning a label z i ∈ [0, 1] to each phrase in the candidate pool, where z i is 1 if the phrase should be selected and 0 otherwise.The phrase information is then integrated to generate y auto-regressively: Encoder and Phrase Selection Recall that the input x contains the context c and the answer a in both Q-model and FA-model, and thus we select the candidate phrases only from the context c by extracting entities, verbs and noun phrases using SpaCy 1 .The phrase selection is formalized as a binary classification task, where the input is a phrase 1 https://spacy.io/encoding obtained from the transformer encoder: } where H ∈ R Tx×d with T x and d being the length of input sequence and dimensions of hidden states, respectively.Here, Encoder indicates the Transformer encoder, of which the details can be found in (Devlin et al., 2019).The phrase representation h z i is obtained by concatenating MaxPooling(•) and MeanPooling(•) of the hidden states {H j } corresponding to i-th phrase.We then employ a linear network with Softmax(•) as the phrase selection probability estimator (Galke and Scherp, 2022).
Probabilistic Fusion in Decoder Decoder consumes previously generated tokens y 1...t−1 then generates the next one as follows: where H is the Encoder output, and DecLayers indicates a stack of N decoder layers.Like Transformer, each PET decoder layer contains three sublayers: 1) the masked multi-head attention layer; 2) the multi-head cross-attention layer; 3) the fully connected feed-forward network.Considering the multi-head cross-attention sublayer is the interaction module between the encoder and decoder, we modify it to take into account the phrase selection probability z i as shown in Figure 2.
Here, we detail the underlying mechanism of each cross-attention head and how we modify it to encode phrase information.Let us recall that the input for a cross-attention layer includes a query state, a key state, and a value state.The query state Q y is the (linear) projection of the output of the first sublayer (the masked multi-head attention layer).Intuitively, Q y encapsulates the information about the previously generated tokens.The key state K h = HW k and the value state V h = HW v are two linear projections of the Encoder output H. W k ∈ R d×d k and W v ∈ R d×dv are the layer parameters, where d k and d v are the dimensions of the key and value states.The output of the crossattention layer is then calculated as follows: Here, we drop the superscripts for simplicity, but the notations should be clear from the context.Theoretically, one can inject the phrase information to either V h or K h .In practice, however, updating the value state introduces noises that counter the effect of pretraining Transformer-based models, which are commonly used for generation.As a result, we integrate phrase probabilities to the key state, thus replacing K h by a new key state Kh : where W δ ∈ R 2×d k is the probabilistic fusion layer.Here, z i is the groundtruth phrase label for phrase i during training (z i ∈ {0, 1}), and the predicted probabilities to select the i-th phrase during inference (z i ∈ [0, 1]).In Q-model, we choose all tokens w i in the full answer s as important tokens.
Training Given the training data set of triples (x, z, y), where x is the input, y is the groundtruth output sequence and z indicates the labels for phrases that can be found in y, we can simultaneously train the phrase selection and the text generation model by optimizing the following loss: where ẑ is the predicted labels for phrase selection, ŷ is the predicted output, λ is a hyper-parameter.

Experimental Setup
Datasets We evaluate our method on two different QG tasks: a complex task on HotpotQA and a simpler task on SQuAD 1.1.There are two settings for HotpotQA (see Table 1): 1) HotpotQA (sup.facts) where the sentences that contain supporting facts for answers are known in advance; 2) HotpotQA (full) where the context is longer and contains several paragraphs from different documents.For SQuAD 1.1, we use the split proposed in Zhou et al. (2017).Although our MultiFactor is expected to work best on HotpotQA (full), we consider HotpotQA (sup.facts) and SQuAD 1.1 to investigate the benefits of multi-level content planning for short contexts.
Implementation Details We exploit two base models for MultiFactor: T5-base2 and MixQGbase3 .To train FA-model, we apply QA2D (Demszky et al., 2018) to convert question and answer pairs to obtain pseudo (gold) full answers.Both Q-model and FA-model are trained with λ of 1.Our code is implemented on Huggingface (Wolf et al., 2020), whereas AdamW (Loshchilov and Hutter, 2019) is used for optimization.More training details and data format are provided in Appendix B.
Baselines The baselines (in Table 2) can be grouped into several categories: 1) Early seq2seq methods that use GRU/LSTM and attention for the input representation, such as SemQG, NGQ++, and s2sa-at-mcp-gsa; 2) Graph-based methods for content planning like ADDQG, DP-Graph, IGND, CQG, MulQG, GATENLL+CT, Graph2seq+RL; 3) Pretrained-language models based methods, including T5-base, CQG, MixQG, and QA4QG.Among these baselines, MixQG and QA4QG are strong ones with QA4QG being the state-of-the-art model on HotpotQA.Here, MixQG is a pretrained model tailored for the QG task whereas QA4QG exploits a Question Answering (QA) model to enhance QG.

Main Results
The performance MultiFactor and baselines are shown in Table 2 with the following main insights.
On HotpotQA, it is observable that our method obtains superior results on nearly all evaluation metrics.Specifically, MultiFactor outperforms the current state-of-the-art model QA4QG by about 8 and 2.5 BLEU-4 points in the full and the supporting facts setting, respectively.Note that we achieve such results with a smaller number of model parameters compared to QA4QG-large.Specifically, the current state-of-the-art model exploits two BART-large models (for QA and QG) with a total number of parameters of 800M, whereas MultiFactor has a total number of parameters of around 440M corresponding to two T5/MixQG-base models.Here, the extra parameters associated with phrase selection in PET (T5/MixQG-base) is only 0.02M, which is relatively small compared to the number of parameters in T5/MixQG-base.
By cross-referencing the performance of common baselines (MixQG or QA4QG) on HotpotQA (full) and HotpotQA (supp.facts), it is evident that these baselines are more effective on HotpotQA (supp.facts).This is intuitive since the provided supporting sentences can be regarded as sentence-level content planning that benefits those on HotpotQA (supp.facts).However, even without this advantage, MultiFactor on HotpotQA (full.)outperforms these baselines on HotpotQA (supp.facts), showing the advantages of MultiFactor for long context.
On SQuAD, MultiFactor is better than most baselines on multiple evaluation metrics, demonstrating the benefits of multi-level content planning even for short-contexts.However, the margin of improvement is not as significant as that seen on HotpotQA.
MultiFactor falls behind some baselines, such as IGND, in terms of ROUGE-L.This could be due to the fact that generating questions on SQuAD requires information mainly from a single sentence.Therefore, a simple copy mechanism like that used in IGND may lead to higher ROUGE-L.

Ablation Study
We study the impact of different components in MultiFactor and show the results with MixQGbase in Table 3 and more details in Appendix C. Here, "Fine-tuned" indicates the MixQG-base model, which is finetuned for our QG tasks.For Cls+Gen, the phrase selection task and the generation task share the encoder and jointly trained like in PET.The phrase information, however, is not integrated into the decoder for generation, just to enhance the encoder.One-hot PET-Q indicates that instead of using the soft labels (probabilities of a phrase to be selected), we use the predicted hard labels (0 or 1) to inject into PET.And finally, PET-Q denotes MultiFactor without the full answer information.
Phrase-level Content Planning By comparing PET-Q, one-hot PET-Q and Cls+Gen to the finetuned MixQG-base in Table 3, we can draw several observations.First, adding the phrase selection task helps improve QG performance.Second, integrating phrase selection to the decoder (in One-hot PET-Q and PET-Q) is more effective than just exploiting phrase classification as an additional task (as in Cls+Gen).Finally, it is recommended to utilize soft labels (as in PET-Q) instead of hard labels (as in One-hot PET-Q) to bias the decoder.
Sentence-level Content Planning By comparing MultiFactor to other variants in Table 3, it becomes apparent that using the full answer prediction helps improve the performance of QG in most cases.The contribution of the FA-model is particularly evident in HotpotQA (full), where the context is longer.In this instance, the FA-model provides an answer-aware summary of the context, which benefits downstream QG.In contrast, for SQuaD where the context is shorter, the FA-model still helps but its impact appears to be less notable.

The Roles of Q-model and FA-model
We investigate two possible causes that may impact the effectiveness of MultiFactor, including potential errors in converting full answers to questions in Q-model, and error propagation from the FA-model to the Q-model.For the first cause, we evaluate Qmodel (w/ Gold-FA), which takes as input the gold full answers, rather than FA-model outputs.For the second cause, we assess Q-model (w/o Context) and Q-model (w/ Oracle-FA).Here, Q-model (w/ Oracle-FA) is provided with the oracle answer, which is the output with the highest BLEU among the top five outputs of FA-model.

Human Evaluation
Automatic evaluation with respect to one gold question cannot account for multiple valid variations that can be generated from the same input context/answer.As a result, three people were recruited to evaluate four models (T5-base, PET-Q, MultiFactor and its variant with Oracle-FA) on 200 random test samples from HotpotQA (supp.facts).Note that the evaluators independently judged whether each generated question is correct or erroneous.In addition, they were not aware of the identity of the models in advance.In the case of an error, evaluators are requested to choose between two types of errors: hop errors and semantic errors.Hop errors refer to questions that miss key information needed to reason the answer, while semantic errors indicate questions that disclose answers or is nonsensical.Additionally, we analyse the ratio of errors in two types of questions on HotpotQA: bridge, which requires multiple hops of information across documents, and comparison, which often starts with "which one" or the answer is of yes/no type.Human evaluation results are shown in Table 5, and we also present some examples in Appendix E.
MultiFactor vs Others Comparing MultiFactor to other models (T5, PET-Q) in Table 5, we observe an increase in the number of correct questions, showing that multi-level content planning is effective.The improvement of MultiFactor over PET-Q is more noticeable in contrast with that in Table 4 with automatic metrics.This partially validates the role of full answers even with short contexts.In such instances, full answers can be seen as an answer-aware paraphrase of the context that is more convenient for downstream QG.In addition, one can see a significant reduction of semantic error in MultiFactor compared to PET-Q.This is because the model better understands how a short answer is positioned in a full answer context, as such we can reduce the disclosure of (short) answers or the wrong choice of question types.However, there is still room for improvement as MultiFactor (w/ Oracle-FA) is still much better than the one with the greedy full answer from FA-model (referred to as Multi in Table 5).Particularly, there should be a significant reduction in hop error if one can choose better outputs from FA-model.
Error Analysis on Question Types It is observable that multi-level content planning plays important roles in reducing errors associated with "bridge" type questions, which is intuitive given the nature of this type.However, we do not observe any significant improvement with comparison type.Further examination reveals two possible reasons: 1) the number of this type of questions is comparably limit; 2) QA2D performs poorly in reconstructing the full answers for this type.Further studies are expected to mitigate these issues.

Comparison with LLM-based QG
As Large Language Model (LLM) performs outstandingly in various text generation tasks, we evaluate the performance of GPT-3.5 zero-shot4 (Brown et al., 2020) and LoRA fine-tuned Llama2-7B (Hu et al., 2022;Touvron et al., 2023)  MixQG-base (finetuned) are given in Table 6, where several observations can be made.Firstly, MultiFactor outperforms other methods on automatic scores by a large margin.Secondly, finetuning results in better automatic scores comparing to zero-shot in-context learning with GPT-3.5-Turbo.Finally, Llama2-7B-LoRA is inferior to methods that are based on finetuning moderate models (T5base/MixQG-base) across all of these metrics.
Human Evaluation As LLM tend to use a wider variety of words, automatic scores based on one gold question do not precisely reflect the quality of these models.As a result, we conducted human evaluation and showed the results on Table 7.Since OpenAI service may regard some prompts as invalid (i.e.non-safe for work), the evaluation was conducted on 100 valid samples from the sample pool that we considered in Section 4.5.The human annotators were asked to compare a pair of methods on two dimensions, the factual consistency and complexity.The first dimension is to ensure that the generated questions are correct, and the second dimension is to prioritize complicated questions as it is the objective of multi-hop QG.
Additionally, Llama2-7b-LoRA outperforms GPT-3.5-Turbo(zero-shot), which is consistent with the automatic evaluation results in Table 6.Interestingly, although T5-base (finetuning) outperforms Llama2-7B-LoRA in in-depth analysis also reveals a common issue with GPT-3.5-Turbo(zero-shot): its output questions often reveal the given answers.Therefore, multi-level content planning in instruction or demonstration for GPT-3.5-Turbocould be used to address this issue in LLM-based QG, potentially resulting in better performance.

Conclusion and Future Works
This paper presents MultiFactor, a novel QG method with multi-level content planning.Specifically, MultiFactor consists of a FA-model, which simultaneously select important phrases and generate an answer-aware summary (a full answer), and Q-model, which takes the generated full answer into account for question generation.Both FA-model and Q-model are formalized as our simple yet effective PET.Experiments on HotpotQA and SQuAD 1.1 demonstrate the effectiveness of our method.
Our in-depth analysis shows that there is a lot of room for improvement following this line of work.
On one hand, we can improve the full answer generation model.On the other hand, we can enhance the Q-model in MultiFactor either by exploiting multiple generated full answers or reducing the error propagation.

Limitations
Our work may have some limitations.First, the experiments are only on English corpus.The effectiveness of MultiFactor is not verified on the datasets of other languages.Second, the context length in sentence-level QG task is not very long as shown in Table 8.For particularly long contexts (> 500 or 1000), it needs more explorations.

Ethics Statement
MultiFactor aims to improve the performance of the answer-aware QG task, especially the complex QG.During our research, we did not collect any other datasets, instead conduct our experiments and construct the corresponding full answer on these previously works.Our generation is completely within the scope of the datasets.Even the result is incorrect, it is still controllable and harmless, no potential risk.The model is currently English language only, whose practical applications is limited in the real world.

A Statistic of Datasets
Here, we list the length of context, question and answer of the HotpotQA and SQuAD 1.1 datasets in Table 8.HotpotQA supporting facts and full document settings share the same output and semigold full answers.Training Details Because we train the model with fixed epochs on HotpotQA and the dev size is too small (500), we select the best result on test dataset directly following the previous work (Pan et al., 2020;Su et al., 2022) on HotpotQA.On SQuAD 1.1, we select the result based on the dev set.Max length of HotpotQA-full is 512, two others is 256.Moreover, the learning rate for MixQGbase is lower than that of the normal T5-base, as stated in (Murakhovs'ka et al., 2022).As a result, we have opted to employ learning rates of 5e-5 and 2e-5 for MixQG-base on HotpotQA and SQuAD 1.1, respectively, while T5-base are 1e-4 and 5e-5.All the batchsize is 32, except that HotpotQA-full is 16, where the training epoch is 5 instead of 10.We turn off the sampling, and beam size are 1 and 5 on HotpotQA and SQuAD 1.1, respectively.Others parameters are default value in Huggingface trainer and generator configuration files.More parameters and time cost of training and inference are in Table 9.
Data Format We list the input formats of these experiments mentioned before in Table 10.And we use special tokens: <ans>, <passage>, <fa> to present the answer, context, and full answer start tokens.

C Ablation Study on T5
Considering T5 is a more general Text2Text Pretrained Lanuague Model, we also conduct ablation studies on T5-base, and the results are shown in Table 11.

D Ablation Study on Flan-T5
We conducted experiments initialized with Flan-T5base to evaluate the performance of instruactionfinetuning model on HotpotQA full document setting.Results are shown in Table 12.Instruction is shown in Figure 3. Corss compared with these results in Table 3 and 11, Flan-T5-base outperforms T5-base significantly but still worse than MixQGbase.MixQG is a QG-specific pre-trained model and fine-tuned on nine various answer-type QA datasets from the T5-base.These results are line with our expectations.

E Error Examples
We list some error examples shown in Figure 4.In hop error, we show three types of hop errors: wrong hop, missing hop, and fabricating information, respectively.In semantic error, we list a declarative generation instead of a question and a nonsensical case in which the output is longer than the input.Lastly, we present a comparison type where both the pseudo gold and generated full answer are wrong, although almost comparison-type QA has no pseudo gold full answer.

Figure 1 :
Figure 1: An example from HotpotQA in which the question generated by MultiFactor requires reasoning over disjointed facts across documents.

Figure 2 :
Figure 2: Overview of our MultiFactor is shown on the left.Here, FA-model and Q-model share the same architecture of Phrase-Enhanced Transformer demonstrated on the right.
Human evaluation results on HotpotQA (supp.facts), where Multi and Ocl-FA indicates MultiFactor (T5-base) and its variant where Q-model is given the oracle full answer (w/ Oracle-FA).The last two lines show the error rates where questions are of bridge or comparison types.

Table 1 :
The statistics of HotpotQA and SQuAD 1.1, where Supp.and Full indicate the supporting facts setting and the full setting of HotpotQA.
Table 4 reveals several observations on HotpotQA (supp.facts) with MultiFactor (T5-base).Firstly, the high effectiveness of Q-model (with Gold-FA) indicates that the difficulty of QG largely lies in the full answer generation.Nevertheless, we can still improve Q-model further, by, e.g., predicting the question type based on the grammatical role of the short answer in FA-model outputs.Secondly, Q-model (w/o Context) outperforms PET-Q but not MultiFactor.This might be because context provides useful information to mitigate the error propagation from FA-model.Finally, the superior of Q-model (with Oracle-FA) over MultiFactor shows that the greedy output of FA-model is suboptimal, and thus being able to evaluate the top FA-model outputs can help improve overall effectiveness.

Table 8 :
The statistic of max/min/mean token length from NLTK tokenizer, the number of positive/negative phrases and the number of valid/total full answer(FA) examples in HotpotQA and SQuAD 1.1 datasets.Details MixQG pre-trained series models are fine-tuned from T5, having the same architecture and number of parameters.In addition to basic modules, MultiFactor adds a classifier (2d × 2) and L d probability infusion layers (2 × d), where d, L d donate the model dimensions and the number of decoder layers.Specifically, when initializing with T5-base (220M, d = 768, L d = 12), Multi-Factor only increases the number of parameters by 1536 × 2 + 12 × 2 × 768 ≈ 0.02M (~0.01%). Model