Learning to Generate Questions by Learning to Recover Answer-containing Sentences

To train a question answering model based on machine reading comprehension (MRC), signiﬁcant effort is required to prepare annotated training data composed of questions and their answers from contexts. Recent research has focused on synthetically generating a question from a given context and an annotated (or generated) answer by training an additional generative model to augment the training data. In light of this research direction, we propose a novel pre-training approach that learns to generate contextually rich questions, by recovering answer-containing sentences. We evaluate our method against existing ones in terms of the quality of generated questions, and ﬁne-tuned MRC model accuracy after training on the data synthetically generated by our method. We consistently improve the question generation capability of existing models such as T5 and UniLM, and achieve state-of-the-art results on MS MARCO and NewsQA, and comparable results to the state-of-the-art on SQuAD. Additionally, the data synthetically generated by our approach is beneﬁcial for boosting up the downstream MRC accuracy across a wide range of datasets, such as SQuAD-v1.1, v2.0, KorQuAD and BioASQ, without any modiﬁcation to the existing MRC models. Furthermore, our method shines especially when a limited amount of pre-training or downstream MRC data is given.


Introduction
Machine reading comprehension (MRC), which finds the answer to a given question from its accompanying paragraphs (called context), is an essential task in natural language processing. With the release of high-quality human-annotated datasets for the task, such as SQuAD-v1.1 (Rajpurkar et al., 2016), -v2.0 (Rajpurkar et al., 2018), and Ko- * These authors contributed equally. rQuAD (Lim et al., 2019), researchers have proposed MRC models even surpassing human scores. These datasets commonly involve finding a snippet within a context as an answer to a given question.
However, these datasets require significant amount of human effort to create questions and their relevant answers from given contexts. Often the size of the annotated data is relatively small compared to that of data used in other selfsupervised tasks such as language modeling, limiting the accuracy.
To overcome this issue, researchers have studied models for generating synthetic questions from a given context along with annotated (or generated) answers on large corpora such as Wikipedia. Golub et al. (2017) suggested a two-stage network of generating question-answer pairs which first chooses answers conditioned on the paragraph and then generates a question conditioned on the chosen answer. Dong et al. (2019) showed that pre-training on unified language modeling from large corpora including Wikipedia improves the question generation capability.  introduced a self-supervised pre-training technique for question generation via the next-sentence generation task.
However, self-supervised pre-training techniques such as language modeling or next sentence generation are not specifically conditioned on the candidate answer and instead treat it like any other phrase, despite the candidate answer being a strong conditional restriction for the question generation task. Also, not all sentences from a paragraph may be relevant to the questions or answers, so the task of their generation may not be an ideal candidate as a pre-training method for question generation tasks.
To address these issues, we propose a novel training method called Answer-containing Sentence Generation (ASGen) for a question generator. ASGen is composed of two steps: (1) predicting [CLS] w 1 w t-1 w t BERT Encoder-A  Figure 1: Architecture of a simple generative model, BertGen. When applying our training method "ASGen" to the model, the question generator takes as input the answer and the context with the answer-containing sentence removed and generates the missing answer-containing sentence.
"answer-like" candidate answers in a given context and (2) pre-training the question generator on the answer-containing sentence generation task. We evaluate our method against existing ones in terms of the generated question quality as well as the finetuned MRC model accuracy after training on the data synthetically generated by our method.
Experimental results demonstrate that our approach consistently improves the question generation quality of existing models such as T5 (Raffel et al., 2020) and UniLM (Dong et al., 2019), and shows state-of-the-art results on MS MARCO (Nguyen et al., 2016), NewsQA (Trischler et al., 2017), as well as comparable results to the state-of-the-art on SQuAD. Additionally, we demonstrate that the synthetically generated data by our approach can boost downstream MRC accuracy across a wide range of datasets, such as SQuAD-v1.1, v2.0, KorQuAD and BioASQ (Tsatsaronis et al., 2015) without any modification to the existing MRC models. Furthermore, our experiments highlight that our method shines especially when a limited amount of training data is given, in terms of both pre-training and downstream MRC data.

Proposed Method
This section discusses our proposed training method called Answer-containing Sentence Generation (ASGen). While ASGen can be applied to any generative model, we use a simple Transformer (Vaswani et al., 2017) based generative model as our baseline, which we call BertGen. First, we will describe how the BertGen model generates synthetic questions and answers from a context. Next, we will explain the details of candi-date answer prediction and how we pre-trained the question generator in BertGen based on them. Bert-Gen encodes given paragraphs with two networks, the answer generator and the question generator.
Answer Generator. To make the contextual embeddings and to predict answer spans for a given context without the question, we utilize a BERT (Devlin et al., 2019) encoder ( Fig. 1-(1), BERT Encoder-A). We select top K candidate answer spans from the context by sorting with confidence score of span prediction. We use the K selected answer spans as input to the question generator.
Question Generator. Next, we generate a question conditioned on each answer predicted from the answer generator. Specifically, we give as input to a BERT encoder the context and an indicator for the answer span location in the context ( Fig. 1-(2), BERT Encoder-Q). Next, a Transformer decoder generates the question word-by-word based on the encoded representation of the context and the answer span. When pre-training the question generator on an answer-containing sentence generation task, we exclude the answer-containing sentence from the original context and train the model to generate the excluded sentence given the modified context and the answer span as input.
Finally, we generate synthetic questions and answers from a large corpus, e.g., all the paragraphs in Wikipedia. After generating this data, we train the MRC model on the generated data in the first phase and then fine-tune on the downstream MRC dataset (e.g., SQuAD) in the second phase. In this paper, we use BERT as the default MRC model, since BERT or its variants achieve state-of-the-art performance across numerous MRC tasks.

Candidate Answer Prediction
In question generation, it is important to determine which part of a given context can be a suitable answer for generating questions. To this end, we predict candidate answer span in the given context W = {w t } T t=0 to obtain a more appropriate set of "answer-like" phrases. To calculate the score s i for start index i of a predicted answer span, we compute the dot product of the encoder output with a trainable vector v s . For each start index i, we calculate the span end index score e i,j for index j in a similar manner with a trainable vector v e , i.e., where T is the number of word tokens in the context and f s represents a fully connected layer with hidden dimension H and ⊕ indicates the concatenation operation. Token at t = 0 is "[CLS]". For training, we use cross-entropy loss on the s i , e i,j with ground truth start, end of the answer span for each token. During inference, we choose top K answer spans with the highest score summation of start index score and end index score, i.e., Selected answer A span k is then given to the question generator as input in the form of an indication of the answer span location in the given context.

Pre-training Question Generator
In order to generate questions conditioned on different answers that may arise in a context, we generate a question for each of the K answers.  proposed a pre-training method for this generative model using the self-supervised task of generating the next-sentence. We identify several issues with this approach. This technique is not specifically conditioned on the answer, despite the answer being a strong condition for the question generation task. Also, not all sentences from a paragraph may be relevant to the questions or answers from within that paragraph, so their generation is not an ideal candidate for pre-training question generation model.
To address these issues, we modify the context to exclude the sentence containing the previously generated answer and pre-train the question generation model on the task of generating this excluded answer-containing sentence, conditioned on the answer and the modified context.
Specifically, we exclude answer-containing sentence S ans while retaining the answer, modifying the original context D to D ans as Note that we change S ans to not exclude the answer-containing sentence for fine-tuning the question generator, i.e., In BertGen, we pass the previously generated answer to the generation model in the form of an additional position encoding M ans that indicates the answer location within the context, i.e., where m 0 and m 1 indicate trainable vectors corresponding to encoding id 0 and 1, respectively. That is, we assign the encoding id for each word in the context as 0 and each word in the answer as 1. A * B indicates the operation of stacking vector A for B many times.
Next, we generate answer-containing sentence output words' probability W o = {w o y } Y y=1 as C enc = BERT Encoder-Q(D ans , M ans ), where C enc is encoded representation of the context and E ∈ R d×V represents a word embedding matrix with vocabulary size V shared between the BERT Encoder-Q and the decoder. Note that w o 0 is a zero vector for starting the decoding.
Finally, we calculate the loss of the generated words using the cross-entropy loss as where z indicates a ground-truth one-hot vector of the answer-containing sentence word. Note that z is the question word in the case of fine-tuning.
In this manner, we pre-train the question generation model using a task similar to the final task of conditionally generating the question from a given answer and a context.

Experimental Setup
Pre-training Dataset. To build the dataset for answer-containing sentence generation tasks (AS-Gen) and the synthetic MRC data for pre-training the downstream MRC models, we collect all paragraphs from the entire English Wikipedia dump and synthetically generate questions and answers on these paragraphs. Note that we removed all passages from Wikipedia overlapping with SQuAD dataset (Rajpurkar et al., 2016). We apply filtering and clean-up steps that are detailed in the appendix.
Using BertGen, we extract answers from each given paragraph, and then generate questions for each answer-paragraph pair. Finally, we obtain 43M triples of question-answer-paragraph for the synthetic data. For pre-training on answercontaining sentence generation, we sample 25M answer-paragraph pairs (Full-Wiki) from the final Wikipedia dataset to avoid extremely short contexts less than 500 characters. For ablation studies on pre-training approaches, we sample 2.5M pairs (Small-Wiki) 1 from Full-Wiki and split 25K pairs (Test-Wiki) to evaluate the pre-training method. Benchmark Datasets. In most MRC datasets, a question and a context are represented as a sequence of words, and the answer span (indices of start and end words) is annotated from the context words based on the question. Among these datasets, we choose SQuAD as the primary benchmark dataset for question generation, since it is the most popular human-annotated MRC dataset. . We refer to this dataset as Split1. This split has 77K/10K/10K samples for train/dev/test sets. We also evaluate on the reversed dev-test split, referred to as Split2. 2 Additionally, we test our question generation on MS MARCO (Nguyen et al., 2016) and NewsQA (Trischler et al., 2017) to evaluate the generalization of our method to other datasets. In the case of MS MARCO, questions are collected from real user query logs in Bing. For these datasets, we follow pre-processing of Tuan et al. (2020), sampling a subset of original data where the answers are sub-spans of their corresponding paragraphs to obtain train/dev/test sets with 51K/6K/7K samples for MS MARCO and 76K/4K/4K samples for NewsQA. We also conduct experiments on question generation with Natural Questions (Kwiatkowski et al., 2019) and BioASQ (Tsatsaronis et al., 2015). We calculate BLEU-4, METEOR, and ROUGE-L with the script from Du et al. (2017).
To evaluate the effectiveness of generated synthetic MRC data, we test the fine-tuned MRC model on the downstream MRC dataset after training on the generated synthetic data. We calculate the EM/F1 score of the MRC model on SQuAD-v1.1 and v2.0 development set. We also evaluate on the test set of KorQuAD, a Korean dataset created with the same procedure as SQuAD-v1.1.

Implementation Details.
For all experiments and models, we use all official original hyperparameters unless otherwise stated below. For Bert-Gen model, we use pre-trained BERT (Base and Large) as encoder and 12 stacked layers of Transformer as decoder. For large version of the model, we use 24 layers of the encoder and the decoder with 737M parameters. For answer prediction, we select top-5 (K = 5) for the answer spans. For the generation of unanswerable questions in SQuAD-v2.0, we separate unanswerable and answerable cases and then train separate generation models. For all BertGen models, we pre-train the question generator for 5 epochs on Wikipedia and fine-tune it for 30 epochs on MRC dataset with batch size of 32. For other question generation models, we pretrain for 1 epoch on Wikipedia. For UniLM and T5, the input is formulated as sequence-to-sequence, the first input segment is the concatenation of context and answer, while the second output segment is a missing answer-containing sentence or a question to be generated. We use all official settings for UniLM, ProphetNet (Qi et al., 2020) and ELEC-TRA (Clark et al., 2020), and use the official pretrained weights. The training time depends on the data size and the model complexity. For Zhao et al. (2018), pre-training on Full-Wiki takes 48 hours. Pre-training BertGen on Small-Wiki in Table 3 takes 48 hours with 8 Tesla V100 GPU, resulting     Table 2, 'BertGen (Large) + ASGen' outperforms all existing models on all scores on both MS MARCO and NewsQA, except for comparable METEOR scores in NewsQA. Our method also shows improvement on Natural Questions (Kwiatkowski et al., 2019) (short answer) dataset, where questions are collected from real user query logs on Google. Ablation Study of Pre-training Task. We also compare the BLEU-4 scores between various pretraining tasks to show the effectiveness of ASGen. As shown in Table 3, ASGen outperforms NS in the recreation score of sentence on Test-Wiki, e.g. 5.2 vs. 1.4 in Small-Wiki and 8.2 vs. 3.4 in Full-Wiki. ASGen outperforms NS in question generation, e.g. 22.2 vs. 20.6 and 24.2 vs. 22.6 in the two splits, respectively. We also observe that conditioning on a given answer improves ASGen, e.g. 20.1 vs. 19.9 in Split1 and 21.4 vs. 21.0 in Split2. Human Evaluation. As Sultan et al. (2020) mentioned in their paper, accuracy-based measurements such as BLEU-4, METEOR and ROUGE-L may not be adequate to test the diversity of a question. Due to this, we also judge the quality of questions by human evaluation involving 10 evaluators over metrics such as syntax, validation of semantics, question to context relevance and question to answer relevance on 50 randomly chosen samples on SQuAD-v1.1 dev set. As shown in Table 4, applying ASGen consistently improves the human evaluation scores.

Downstream MRC Task Performance
To show the effectiveness of the generated synthetic data, we train MRC models on generated data, before fine-tuning on the downstream data. As shown in Table 5, the synthetic data generated by 'Bert-Gen (Large) + ASGen' consistently improves the performance of BERT (Large, WWM) by a significant margin. Pre-training BERT on synthetic data improves F1 scores by 1.8 on SQuAD-v1.1 and 5.6 on SQuAD-v2.0 for BERT (Large), and 0.7 on SQuAD-v1.1 and 2.5 on SQuAD-v2.0 for BERT (WWM). Synthetic data also improves ELECTRA performance on SQuAD-v2.0, and BERT+CLKT performance on KorQuAD. Also, to show improvement due to our pretraining method in the downstream MRC task, we compare the EM/F1 scores of BERT (Large) models trained on synthetic data generated by different question generation models, 'BertGen', 'BertGen + NS' and 'BertGen + ASGen'. As shown in Table 6, our method outperforms other methods both on SQuAD-v1.1 and SQuAD-v2.0. Fig. 2 shows the effects of varying amounts of downstream MRC data and synthetic data on F1 scores of BERT (Large). In Fig. 2-(a), where we fix the size of synthetic data as 43M, pre-training with  'BertGen + ASGen' consistently outperforms 'Bert-Gen + NS' for all sizes of downstream data. While the performance difference is particularly apparent for smaller sizes of downstream data, it persists even on using the entire MRC data (SQuAD-v1.1). In Fig. 2-(b), we also conduct experiments by training BERT (Large) using different amounts of generated synthetic data while keeping the number of pre-training steps constant and using the full size of downstream MRC data. Increasing the amount of synthetic data used consistently improves the accuracy of the MRC model.

Transfer Learning to Limited Domain
We also conduct experiments on BioASQ (Tsatsaronis et al., 2015) dataset to show the effectiveness of our model in limited-data domains having less annotated data. As shown in Table 7, ASGen improves the question generation scores by 6.0 BLEU-4, 7.8 METEOR and 6.9 ROUGE-L on BioASQ factoid-type 6b. Moreover, using 'Full-Wiki' data enhances the performance of BERT(Large) by a large margin and outperforms BioBERT (Lee et al., 2019a), by 0.95 Macro F1 (Yes/No) and 1.63 F1 (List). Note that BioBERT is specifically pre- trained on a medical corpus (PubMed) whereas we use a generic Wikipedia corpus ('Full-Wiki'), with our generation models fine-tuned on SQuAD.

Qualitative Analysis of Generation
Comparison of Sample Questions. We qualitatively compare the generated questions after pretraining BertGen with NS and ASGen to demonstrate the effectiveness of our method. For the correct answer "49.6%" as shown in the first sample in Table 9, the word "Fresno", which is critical to make the question specific, is omitted by NS, while ASGen's question does not suffer from this issue. Note that the word "Fresno" occurs in the answer-containing sentence. This issue also occurs in the second sample, where NS uses the word "available" rather than relevant words from the answer-containing sentence, but ASGen uses many of these words such as "most" and "popular" to generate contextually rich questions. Also, the question from NS is about "two" libraries, while the answer is about "three" libraries, showing the lack of sufficient conditioning on the answer. Similarly, the third example also shows that ASGen generates more contextual questions than NS by including the exact subject "TARDIS" based on the corresponding answer. Based on these observations and from the score improvements in Table 3, we conjecture that ASGen leads the question generation model to better condition on the answer and to  generate more contextualized questions than NS. Categorization of Reasoning Type. We manually categorized the reasoning type of 150 randomly sampled generated questions on Wikipedia for both answerable and unanswerable questions. The results Table 8 and Table 10 show that generated questions using ASGen often require multi-hop or other non-trivial reasoning. We follow the same categorization as done by Rajpurkar et al. (2016Rajpurkar et al. ( , 2018.  (Raffel et al., 2020) and BART (Lewis et al., 2020) utilize Transformer (Vaswani et al., 2017) to learn different types of language models on a large dataset followed by fine-tuning on a downstream task. These pre-training approaches tend to be very generic, while our approach is a more appropriate pre-training method focused on the specific task of question generation. Lee et al. (2019b) suggested a pre-training method for information retrieval called Inverse Cloze Task. Unlike this method, our pretraining task for the question generator is strongly conditioned on the answer and focuses on generating missing answer-containing sentence in the context to learn better representations more suitable to the question generation task. Synthetic Data Generation. Subramanian et al. (2018) show that neural models generate better candidate answers from a given paragraph than using off-the-shelf tools or selecting named entities and noun phrases.  introduced a training method for the MRC model by combining synthetic data and human-annotated data. Similar to our method, Golub et al. (2017) proposed to generate questions conditioned on generated answers by separating the answer generation and the question generation. Unlike this paper, they do not pre-train their question generator on the answer-containing sentences. Dong et al. (2019) also show that utilizing synthetic data boosts the performance of MRC models. Inspired by these previous studies, we propose a newly designed pre-training technique that improves capability of question generation models.

Conclusions
We propose a novel pre-training method called AS-Gen to learn generating contextually rich questions better conditioned on the answers. Our approach improves question generation ability of existing methods, achieves new state-of-the-art results on MS MARCO and NewsQA, and the synthetic data increases downstream MRC accuracy across a wide range of datasets without any modification to the existing MRC models.

A Question Generation on more Datasets
We also evaluate the question generation model on another data split (Split3) from Zhao et al. (2018). Split3 is obtained by dividing the original development set in SQuAD-v1.1 into two equal halves randomly and choosing one of them as the development set and the other as test set while retaining the train set in SQuAD-v1.1. As shown in Table 11, applying ASGen to the reproduced question generation model from Zhao et al. (2018) improves BLEU-4, METEOR, and ROUGE-L score on Split3 by 1.3, 0.9, and 1.3, respectively. To further study the effect of training data size, we apply our synthetic data to the ELECTRA (Clark et al., 2020) MRC model. In Table 12, we report the mean EM/F1 score on SQuAD 2.0 development set of four runs by using official Electra source code 3 and the pre-trained checkpoint. Pre-training ELECTRA on the generated synthetic data using ASGen improves 0.8 EM and 1.1 F1 score on the downstream MRC dataset, SQuAD-v2.0.

C Additional Downstream MRC Task Performance
Additionally to show the effectiveness of the generated synthetic data, we also train MRC models on generated data, before fine-tuning on two downstream datasets, Natural Questions and NewsQA. As shown in Table 13, the synthetic data generated by 'BertGen (Large) + ASGen' consistently improves the F1 score of baseline BERT models.

D Transfer Learning to Other MRC Dataset (QUASAR-T)
To show that our generated data is useful for other MRC datasets, we fine-tune and test the MRC model on QUASAR-T (Dhingra et al., 2017), which is another large-scale MRC dataset, after training on the synthetic data generated from 3 https://github.com/google-research/electra SQuAD-v1.1. In this experiment, we first fine-tune 'BertGen + ASGen' using SQuAD-v1.1, and using synthetic data generated by this model, we train the BERT (Large) MRC model. Afterwards, we fine-tune BERT (Large) for the downstream MRC task using QUASAR-T data. QUASAR-T has two separate datasets, one with short snippets as context, and the other with long paragraphs as context. As shown in Table 14, training with our synthetic data improves the F1 score on the test set by 2.2 and 1.7 for the two cases, respectively.

E Details of Wikipedia Preprocessing
To build the answer-containing sentence generation data and the synthetic MRC data for SQuAD (Rajpurkar et al., 2016), we collect all paragraphs from all articles of the entire English Wikipedia dump and generate questions and answers on these paragraphs. We apply extensive filtering and clean-up to only retain the highest-quality paragraphs from Wikipedia, as follows.
To filter out low-quality articles, we remove those with less than 200 cumulative page-views including all re-directions in a two-month period.
In order to calculate the number of page-views, official Wikipedia page-view dumps were used. Of the 5.4M original Wikipedia articles, filtering by page-views leaves 2.8M articles. We also remove those articles with less than 500 characters, as they are often low-quality stub articles, which further removes additional 16% of the articles. We remove all "meta" namespace pages such as talk, disambiguation, user pages, portals, etc. as they often contain irrelevant text or casual conversations between editors. In order to extract clean text from the wiki-markup format of the Wikipedia articles, we remove extraneous entities from the markup including table of contents, headers, footers, links/URLs, image captions, IPA double parentheticals, category tables, math equations, unit conversions, HTML escape codes, section headings, double brace templates such as info-boxes, image galleries, HTML tags, HTML comments, and all tables.
We then split the cleaned text into paragraphs and remove all paragraphs with less than 150 characters or more than 3,500 characters. Paragraphs with the number of characters between 150 to 500 were sub-sampled such that these paragraphs make up 16.5% of the final dataset, as originally done for the SQuAD dataset. Since the majority of the paragraphs in Wikipedia are rather short, out of the 60M paragraphs from the final 2.4M articles, our final Wikipedia dataset contains 8.3M paragraphs. Finally, we generate 43M answer-paragraph pairs from the final Wikipedia dataset with the answer generator of BertGen in this paper.

F Central Tendency and Variation for Human Evaluation
Human evaluation involves 10 evaluators over metrics such as syntax (ST), validation of semantics (SM), question to context relevance (CR) and question to answer relevance (AR) on 50 randomly chosen samples on SQuAD-v1.1 development set. Each score is in the range 1 to 5. Central tendency and variation can be found in Table 15.

G Central Tendency and Variation for the Downstream Tasks
For the EM and F1 scores on downstream SQuAD-v1.1 and v2.0 development set in our main paper, we selected 5 model checkpoints from the same pre-training on the synthetic data in different numbers of training steps. We then fine-tuned each of these models on the final downstream data three times each, chose the best performing model on the development set, and reported its score. Central tendency and variation can be found in Table 16.

H Details of Generating Unanswerable Questions
The mechanism of generating questions may differ in generating answerable and unanswerable questions. For example, the model could exploit a mismatched phrase to make a question plausible but unanswerable. In order to reflect these characteristics, we train answerable and unanswerable models separately. We first take the BertGen model pretrained on the ASGen task and then fine-tune this model on the no-answer question generation on SQuAD-v2.0. We infer with this model on the entire Wikipedia to make negative examples for un-answerble synthetic data for pre-training MRC models on SQuAD-v2.0.