Generative Language Models for Paragraph-Level Question Generation

Powerful generative models have led to recent progress in question generation (QG). However, it is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches. In this paper, we introduce QG-Bench, a multilingual and multidomain benchmark for QG that unifies existing question answering datasets by converting them to a standard QG setting. It includes general-purpose datasets such as SQuAD for English, datasets from ten domains and two styles, as well as datasets in eight different languages. Using QG-Bench as a reference, we perform an extensive analysis of the capabilities of language models for the task. First, we propose robust QG baselines based on fine-tuning generative language models. Then, we complement automatic evaluation based on standard metrics with an extensive manual evaluation, which in turn sheds light on the difficulty of evaluating QG models. Finally, we analyse both the domain adaptability of these models as well as the effectiveness of multilingual models in languages other than English.QG-Bench is released along with the fine-tuned models presented in the paper (https://github.com/asahi417/lm-question-generation), which are also available as a demo (https://autoqg.net/).


Introduction
Question generation (QG, Mitkov and Ha, 2003) is the task of generating a question given an input context consisting of a document, a paragraph or a sentence, and an answer where the question is anchored (see Figure 1).QG has been widely studied in natural language processing communities (Du et al., 2017;Zhou et al., 2017;Du and Cardie, 2018), and it has recently been exploited to train question answering (QA) models without human supervision (Lewis et al., 2019;Zhang and  Bansal, 2019; Puri et al., 2020), or as a means of data augmentation (Shakeri et al., 2020;Bartolo et al., 2021).It has also been applied to develop educational systems (Heilman and Smith, 2010;Lindberg et al., 2013), information retrieval models (Pyatkin et al., 2021;Lewis et al., 2021), and for model interpretation (Perez et al., 2020;Lee et al., 2020).
Despite its success in downstream applications, the development of neural QG models has received less attention.For example, the choice of the base pre-trained model is arbitrary (without proper justification in most cases) as it is not straightforward to compare different models.As a consequence, while ERNIE-GEN (Xiao et al., 2021) and UniLMv2 (Bao et al., 2020) are current SotA in the SQuAD QG benchmark (Du et al., 2017), T5 (Raffel et al., 2020) and BART (Lewis et al., 2020a) are used in many applications in practice (Paranjape et al., 2021;Bartolo et al., 2021;Lewis et al., 2021;Pyatkin et al., 2021).
A possible reason is inconsistent evaluation and comparison of QG models, due to the lack of appropriate evaluation protocols and benchmarks.For instance, evaluation of QG models relies on BLEU4 (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014), and ROUGE L (Lin, 2004), with human-made questions as references.However, some of these metrics may have low correlation with human judgements, especially when it comes to answerability, since they tend not to take the associated answer into account (Nema and Khapra, 2018).Moreover, QG applications can use different contexts as input, such as sentence-level (Pyatkin et al., 2021;Lewis et al., 2019) vs paragraph-level (Zhang and Bansal, 2019;Puri et al., 2020), or answer-aware (Shakeri et al., 2020;Bartolo et al., 2021) vs answer-free (Lopez et al., 2020).These are generally used interchangeably in the literature.
To investigate how to tackle the issues previously raised, we introduce QG-Bench, a collection of standard QA datasets unified into a single benchmark, including domain-specific datasets and for eight different languages ( § 3).We then use QG-Bench to fine-tune various generative language models (LMs) by formulating paragraph-level QG as a sequence-to-sequence generation task ( § 4), and measure their performance on in-domain and language-specific data ( § 5).Finally, we present a multi-faceted analysis of our QG models by varying their input context size ( § 6.1), conducting a manual evaluation ( § 6.2), and studying their abilities for domain adaptation ( § 6.3).

Related Work
Early work on QG was based on human-engineered templates (Mitkov and Ha, 2003;Rus et al., 2010) and well-designed pipelines (Heilman and Smith, 2010;Labutov et al., 2015), but soon neural approaches took over by generating a question from a text in an end-to-end manner (Du et al., 2017;Zhou et al., 2017;Du and Cardie, 2018).The quality of QG models was later improved by masked LM pre-training (Devlin et al., 2019;Liu et al., 2019) where the encoder of the QG model is fine-tuned from pre-trained LMs (Chan and Fan, 2019;Zhang and Bansal, 2019).Recently, sequence-to-sequence LM pre-training has allowed to fully fine-tune QG models (both encoder and decoder), achieving SotA performance (Dong et al., 2019;Qi et al., 2020;Bao et al., 2020;Xiao et al., 2021).Following the latest research in the literature, we focus on sequence-to-sequence LM-based QG models.
QG can be applied to domain adaptation (Shakeri et al., 2020), knowledge-enhanced LM pre-training (Jia et al., 2021), adversarial/counterfactual data augmentation (Bartolo et al., 2021;Paranjape et al., 2021), and nearest neighbour QA systems (Lewis et al., 2021).Applications of QG go beyond QA, including semantic role labeling (Pyatkin et al., 2021), visual QA (Krishna et al., 2019), multi-hop question decomposition (Perez et al., 2020), and question rewriting (Lee et al., 2020).Moreover, QG can be applied to unsupervised QA, which consists of training a QA model without any supervision and relying on a QG model to generate questions (Lewis et al., 2019).Puri et al. (2020) showed that with a carefullydesigned QG model, we can generate high-quality QA datasets on which a QA model can even outperform their supervised counterparts.This inspired Zhang and Bansal (2019) to propose QA-based evaluation, which connects the quality of a QG model to the accuracy of a QA model trained on the synthetic data generated by the QG model.
While QG models can be applied to this variety of tasks, the comparison across tasks is not always straightforward.For this reason, and given the relevance of QG in current research, in this paper we propose an intrinsic QG benchmark in which we can evaluate different aspects of a QG model in a simple manner, including, but not only, analysis of input types, domain adaptability and multilinguality.The most similar work to ours is the MTG benchmark (Chen et al., 2021), which contains multilingual test sets for four NLG tasks.While QG is part of this benchmark, there are a few major differences from our proposed QG-Bench: (i) we provide training/validation/test sets to allow model training in each language in addition to the evaluation; (ii) MTG's test set consists of parallel sentences across languages by a translation from English, while we leverage monolingual datasets; (iii) we include eight languages, while MTG has five; and (iv) QG-Bench includes datasets from different domains and styles.

QG-Bench: A Unified Question Generation Benchmark
In this section, we describe our process to construct QG-Bench, including data collection and unification ( § 3.1), and its statistics ( § 3.2).

Data Collection and Unification
We unified a collection of datasets, designed to be used for QG model training and evaluation.All datasets are in the same format, where each entry contains four features: paragraph, sentence, question, and answer.As described in Figure 1, we assume question as the output of a QG system, which is conditioned by an answer and it is always a sub-string of a sentence from a paragraph.We leverage existing QA datasets by compiling them into this unified QG format.All datasets included in QG-Bench are described below.SQuAD (English).We first consider SQuAD v1.1 (Rajpurkar et al., 2016), an extractive QA dataset based on Wikipedia which has been used in QG commonly since (Du et al., 2017;Zhou et al., 2017).As the original test set of SQuAD is not released, we use the same data split as in (Du et al., 2017).Domain-specific Datasets (English).To assess models' domain adaptivity, we consider two domain-specific QA datasets: SQuADShifts (Miller et al., 2020) and SubjQA (Bjerva et al., 2020).SQuADShifts contains questions in the same style of SQuAD but from four additional domains (Amazon/Wikipedia/News/Reddit), while  (Clark et al., 2020;Artetxe et al., 2020;Lewis et al., 2020b) to obtain multilingual QG datasets, but XQuAD (Artetxe et al., 2020) and MLQA (Lewis et al., 2020b) do not contain training sets, and Ty-diQA (Clark et al., 2020) contains a very small training set.Instead, we focused on monolingual QA datasets in each language.

Data Statistics
Table 1 summarizes statistics of each QG dataset after unification.It can be observed that SubjQA and SQuADShifts have ten to a hundred times less training data than SQuAD.Also, SubjQA's answers are twice longer than SQuAD's answers, which can be explained by how they differ in the way questions are formed (i.e., SubjQA being more subjective in nature).Likewise, except for Spanish, the datasets for languages other than English contain less training data than the original SQuAD, with the number varying depending on the language.
4 LMs for Question Generation In this section, we formalize the QG task from a language modelling perspective ( § 4.1), including details on the fine-tuning process ( § 4.2) and the setup for our experiments with QG-Bench ( § 4.3).

Task Formulation
Given an input text x, the goal of QG is to generate a natural question q related to the information in the input.The task is formulated as a conditional sequence generation, and the model is optimized to maximize the conditional log-likelihood P (q|x) as in Equation 1.
In practice, the log-likelihood is factorized into word or subword level predictions, similar to other sequence-to-sequence learning settings (Sutskever et al., 2014).

Language Model Fine-tuning
Fine-tuning sequence-to-sequence LMs on QG can be done in the same way as for Machine Translation or Summarization, where models are trained to predict the output tokens given the input tokens (Dong et al., 2019;Qi et al., 2020;Bao et al., 2020;Xiao et al., 2021).We follow Chan and Fan (2019) by introducing a highlight token <hl> to take into account an answer a within a context c as below: Instead of a paragraph, we can similarly use a sentence to highlight an answer (sentence-level QG) or highlight a sentence instead of an answer (answerfree QG).We investigate these model variations in our analysis ( § 6.1), but assume the answer highlighted paragraph as the default input.Note that it is possible to train other types of LMs on QG, but masked LMs were not designed for natural language generation and require a specific decoding technique (Chan and Fan, 2019).Also, recurrent LMs have poor ability for conditional generation on the answer due to its unidirectional architecture (Lopez et al., 2020).Since they are not as suited for QG as the sequence-to-sequence models, they are out of the scope of this paper.

Experimental Setup
Comparison Models.As sequence-to-sequence LMs, we use T5 (Raffel et al., 2020) and BART (Lewis et al., 2020a) for the English datasets and mT5 (Xue et al., 2021) and mBART (Liu et al., 2020) for the multilingual experiments.Model weights are taken from HuggingFace (Wolf et al., 2020). 3Previous research reported improvements on QG with more recent LMs (Qi et al., 2020;Xiao et al., 2021;Bao et al., 2020).We tried to replicate these previous works in QG-Bench, but after multiple attempts using their provided code and contacting the authors, this was not possible.Nonetheless, both T5 and BART are widely used in practice and, as we will show, they can still provide strong results with an appropriate configuration.Parameter Optimization.We performed an extensive exploration to find the best combination of hyper-parameters to fine-tune LMs on QG, which consists of a two-phase search.First, we fine-tune a model on every possible configuration from the search space for 2 epochs.The top-5 models in terms of BLEU4 (Papineni et al., 2002) on the validation set are selected to continue fine-tuning until their performance plateaus. 4Finally, the model that achieves the highest BLEU4 on the validation set is employed as the final model.We used BLEU4 as an objective metric in our parameter optimization since it is light to compute, and following previous work (Du and Cardie, 2018;Dong et al., 2019;Xiao et al., 2021).However, as we will see in our experiments, future work could also explore the usage of alternative metrics for validation.The search space contains 24 configurations, which are made up of learning rates from [0.0001, 0.00005, 0.00001], label smoothing from [0.0, 0.15], and batch size from [64,128,256,512]. 5 Our experiments show that this simple parameter optimization strategy significantly improves all models' performances by robustly finding the best configuration for each one. 6 We ran the parameter optimization on a machine equipped with two Nvidia Quadro RTX 8000.Taking SQuAD as a reference, training and evaluation took around three weeks for T5 LARGE , one week for T5 BASE and mT5 SMALL , three days for T5 SMALL , one week for BART LARGE , and four days for BART SMALL .

Automatic Evaluation
In this section, we report the main results in QG-Bench ( § 3), using the methodology described in § 4. SotA models taken from their original papers, while the bottom row contains our models.

Evaluation Metrics
To evaluate QG models, BLEU4 (B4, Papineni et al., 2002), METEOR (MTR, Denkowski and Lavie, 2014), and ROUGE L (R-L, Lin, 2004) are commonly used to compare the generated outputs against reference questions at sentence level.We also compute BERTScore (BS, Zhang et al., 2019) and MoverScore (MS, Zhao et al., 2019).Both leverage BERT-like models on their computation, achieving higher correlations with human judgements than other traditional metrics in various NLG tasks (Zhang et al., 2019;Zhao et al., 2019).To the best of our knowledge, they have not been applied in QG evaluation before, regardless of their success in NLG.We use the default configuration for both metrics, which make use of RoBERTa LARGE (Liu et al., 2019) for BERTScore and DistilBERT BASE (Sanh et al., 2019) for MoverScore.

Results
SQuAD.Table 2 shows our results on the SQuAD test set along with other reported results from the literature.T5 LARGE provides the best results overall according to all automatic metrics.Even parameter-efficient models such as T5 BASE outperform ERNIE-GEN (Xiao et al., 2021), and T5 SMALL performs competitively with UniLMv2 (Bao et al., 2020) with nearly half the parameters.UniLMv2, in particular, was proposed as a highly-effective model in spite of its light weight.
According to these results, T5 SMALL is also competitive on the QG task while being significantly lighter than other models.While T5 attains the best overall results, BART also proves competitive.
In fact, BART BASE is slightly better than T5 BASE   initialize their weights with a SQuAD fine-tuned model, and continue fine-tuning on the domainspecific training set (more details on different strategies in § 6.3).As expected, given the subjective nature of the dataset, results on SubjQA are generally low for most metrics, except for BERTScore whose score is even higher than in SQuAD in some cases.This implies that a model's prediction may have less word-overlap against the true question, while its semantics is close to the true question to some extent.

Analysis
In this section, we complement the automatic evaluation with an extensive analysis on various relevant aspects of the question generation models.

Model Input
In our main experiments, the model input is the paragraph in which the answer is highlighted, as described in § 4.2.Here we explore variations of the QG model's input type to understand the effect of different types of context.Concretely we consider two additional variants: sentence-level models that only take as input the sentence that contains the answer (instead of the whole paragraph); and answer-free models that highlight the sentence in the paragraph instead of the answer.Figure 2 provides a summary of the three different input types analysed.
In Table 5 we report automatic metrics from answer-free models and sentence-level QG models on SQuAD.In general, paragraph-based models, which use the most complete input, attain the best overall results.For example, answer-free T5 LARGE performs worse than paragraph-level T5 SMALL in all the metrics except METEOR, which indicates the importance of the answer at question generation.Nonetheless, not having the answer as input provides competitive results, which may appear to be surprising given the incomplete input.When comparing sentence-level and paragraph-level, the difference is reduced, but paragraph-level models consistently outperform their sentence-level counterparts, even when smaller models are used.This implies that models actually utilize the global context provided by the full paragraph when it is available, rather than the more local information within the sentence only.

Manual Evaluation
Given the limitations of automatic metrics in text generation research (Reiter, 2018;Bhandari et al., 2020;Alva-Manchego et al., 2021), we also conducted a manual evaluation using Amazon Mechanical Turk, focusing on three criteria: grammaticality (i.e.grammatical correctness), understandability (i.e.whether the question is easy to be understood by readers) and answerability (i.e.whether the question can be answered by the given input answer). 7We randomly sampled 500 unique paragraphs from the SQuAD test set and selected a single answer in each paragraph.For each of the 500 paragraph-answer pairs, we generated questions from six QG models, and asked human anno-7 Understandability could correlate with grammaticality, but a question without any grammatical mistakes can have low understandability due to an over complex structure.Likewise, a question can be understandable even with a few grammatical mistakes.Annotation guidelines are included in the Appendix.

Model
Manual Table 6: Manual evaluation results along with the automatic metrics.Each score is averaged within the 500 questions for the evaluation where the best result in each metric is in bold face.
tators to score them for the criteria with a 3-points scale.Each question was evaluated by five judges, thus collecting a total of 15,000 human judgments.As quality control, we asked workers to be native English speakers, and instructed them to do a qualification test first, and only those who passed the test worked on our annotation task.The given time of each assignment (with each assignment containing ten instances to annotate) was 30 minutes, and the reward of the annotation task was $2 per assignment. 8We attach a screenshot of the annotation interface in the Appendix.Comparison Models.For the manual evaluation, the target QG models include T5 LARGE , T5 SMALL and BART LARGE based paragraph-level QG models; T5 LARGE sentence-level and answer-free QG models; and NQG (Du et al., 2017), which is based on an LSTM-architecture.NQG is included for completeness and to better analyse the effect of pre-trained LMs in general.T5 LARGE is our best model according to automatic metrics, so we compare it against different input types (answer-free or sentence-level), different sizes (T5 SMALL ), and different model architectures (BART LARGE ).Inter-annotator Agreement.Since there are five unique annotators per each generated question, we calculated Fleiss's kappa to measure the inter-annotator-agreement. We obtained 0.30 and 0.36 for grammaticality and understandability respectively, resulting in fair-agreement (Landis and Koch, 1977).The kappa is 0.61 in answerability, which is a substantial-agreement.Model-wise Evaluation.We report the results of our manual evaluation in Table 6, where each score is averaged over the 500 questions used in the study.Answerability is the most affected by model size/context and type/model architecture, compared to the other metrics, except for NQG, which is 8 The full price of annotation exercise was about $3,000.the only non-LM pre-training based approach.In fact, when we compare T5 LARGE 's paragraph-level against sentence-level, answerability decreases unlike the other two criteria, highlighting the importance of including all relevant context available so that the model can generate a suitable question.On the other hand, while answer-free models are worse than sentence-level models according to automatic metrics, the manual evaluation does not reflect a significant difference between them.In general, we can see how T5 LARGE , which is the best model overall according to the automatic metrics, is also the most robust model overall according to the manual evaluation, which reinforces the conclusions from the automatic evaluation.Correlation Analysis.Leveraging the large dataset of collected human judgments, we investigate the correlation between human annotations and the automatic metrics considered in the automatic evaluation ( § 5.2).For this analysis, we included all the generated questions from all the models considered in the manual evaluation.This means 3,000 generated questions from six diverse models where each question receives five annotations.We took the average across all the five annotators for each generated question to compute the correlation.Figure 3 shows the Spearman's rank correlation coefficient across the automatic metrics and the criteria collected through our manual evaluation. 9The pvalues of all correlations are less than 0.05, so they are all statistically significant.To check the significance of the increase in the correlation across metrics, we ran a William test, showing that the increase is statistically significant in all cases. 10ccording to the correlation analysis, no metric achieved a high agreement with human judgements in all criteria.This means that we should not rely on a single metric to capture all quality aspects of a model's output.We can conclude, however, that METEOR and MoverScore are well-aligned with human judgements on answerability, while BERTScore appears to be better suited for grammaticality and understandability.Most importantly, BLEU4 and ROUGE L , which have been mostly used in the QG literature as default metrics, are not as reliable as the other metrics in any criteria.

Domain Adaptation
In our main experiments in the domain-specific datasets of QG-Bench ( § 5.2), models were initialized by the SQuAD fine-tuned model due to the limited training set in each domain.To further explore the domain adaptability of QG models, we compared three different setups: (1) fine-tuning in the in-domain training set without SQuAD initialization, (2) zero-shot transfer from the SQuAD fine-tuned model, and (3) fine-tuning with a prior SQuAD initialization.Figure 4 shows the results of T5 LARGE (the best model in most of the domains in Table 4 and the manual evaluation) in each domain for those three settings.For this analysis, we focus on the METEOR metric, which attains the highest correlation with human judges in answerability. 11e can confirm that the best setup is to initialize the model on SQuAD and then further fine-tune it on the domain-specific training sets.For SQuAD-Shifts, however, this improvement is less marked in general, suggesting that T5 can handle inputs from different domains to a certain extent.In contrast, the zero-shot setting with SQuAD fine-tuning in SubjQA achieves very poor results overall.This is to a certain extent expected since the questions in SubjQA are of very different styles.
Finally, while in this section we focused on the domain adaptability for English, in the Appendix we also show zero-shot cross-lingual transfer results, adapting English-training models to other languages.Similarly to previous work (Chen et al., 2021), the main conclusion is that there is still significant room for improvement for zero-shot crosslingual transfer in QG.

Conclusion
In this paper we presented QG-Bench, a unified benchmark and evaluation for testing paragrahlevel QG models.The benchmark is composed of the general-purpose SQuAD dataset, as well as domain-specific datasets of different styles for English.Moreover, it includes language-specific datasets for eight different languages.Using QG-Bench as a reference, we tested recent generative language models on the task, and evaluated them across a range of automatic metrics.To complement the automatic evaluation, we performed a comprehensive manual evaluation to better understand the performance of models and the role of automatic metrics (e.g., our study shows there are better metrics than the popular BLEU4 when it comes to QG).In general, our results show that LMs have come a long way for QG, being very competitive (e.g., T5 attains an overall manual score of, respectively, 2.80, 2.93 and 2.95 in answerability, grammaticality and understandability on SQuAD), but have room for improvement when dealing with different domains and styles, and especially on languages other than English.
As future work, we will continue to study QG evaluation metrics in-depth to better understand what aspects we are missing when we use specific automatic metrics, using our manual evaluation as a proxy.Moreover, the QG models analysed in this paper require an answer to be specified beforehand to generate the question.As a way to relax the constrain, we can train models for question and answer pair generation (QAG) by generating the answer together with the question given a context.By generating both answers and questions together, new evaluation metrics would also be required to understand the validity and diversity of the answers selected, which we leave for future work.

Limitations
In this paper, we explored paragraph-level QG models, which limits their input up to around 500 tokens, and the same methodology cannot be easily applied to longer documents.In multilingual QG modeling, we considered datasets in seven different languages, but all of them are medium-to high-resource languages, so our experimental results cannot be generalized to a truly low-resource language setting.Finally, although the focus of our paper is mostly in SQuAD-style one hop extractive QA, QG is also studied in more complex scenarios such as multi-hop QG with graph neural networks (Pan et al., 2020) and QG for very long answers (Cao and Wang, 2021).Moreover, QG models are used to attain better interpretability in question answering such as multi-hop question decomposition (Perez et al., 2020) and question rewriting (Lee et al., 2020).As future work, we will expand our analysis to more complex scenarios and explore the connectivity with the QA task.

Ethics Statement
As the potential risk at using our QG models, it has been reported that language models inherit undesirable biases and generate toxic language (Schick et al., 2021), and one could find such text in the generated question.However, we internally checked the generated questions used for the manual evaluation, and confirmed that they did not contain toxic content.Table 7 shows the best configuration to fine-tune each model that we obtain through the parameter optimization process.To fine-tune T5 model, we use the task prefix generate question: at the beginning of the input text.
Table 8 shows the decrease in each metric for SQuAD if the model is fine-tuned without parameter optimization. 12We observe decent drops in performance.T5 SMALL and BART LARGE lose around 2 points in BLEU4 and ROUGE L .According to these results, we infer that T5 and BART were worse than more recent LMs (ProphetNet, UniLM, or ERNIE-GEN) in QG just because they were under-fitted to the task due to sub-optimal finetuning parameters, rather than they being inferior to those recent LMs in terms of learning the QG task.

B Manual Evaluation B.1 Sample Outputs
Table 9 presents a few examples of our model predictions with the scores made by the annotators, where the samples are chosen from the high-answerability and low-answerability groups of T5 LARGE .

B.3 William test
In § 6.2, we run correlation analysis and here we report the result of the William test to check the significance of the increase in the correlation across metrics in Figure 7, showing that the increase is statistically significant as well.

B.4 Guidelines
Figure 8 shows an example of user interface we implemented for our manual evaluation and the guideline we present to the annotators is attached to the end of the paper.

C Unsupervised QA-based Evaluation
As a proxy for answerability, we run an unsupervised QA-based evaluation (Zhang and Bansal, 2019), which trains a QA model on synthetic data generated by the target QG model and evaluates the QA model on a human annotated test set.As an alternative to the traditional metrics in QG, Q-metric (Nema and Khapra, 2018) shows high agreement in terms of the answerability, but we prefer to employ QA-based evaluation (Zhang and Bansal, 2019), since it is more closely tied to downstream applications, while Q-metric relies on some heuristics such as the number of named-entity/pre-defined question types.This evaluates the QG model's capability to generate high quality questions: higher accuracy of the QA model indicates a better QG model.The synthetic data is usually generated over the paragraph and answer (PA) pairs collected by Du and Cardie (2018).Zhang and Bansal (2019) used a small subset of the PA pairs, since they contain 12x larger instances than the SQuAD training set.Since this introduces an artifact of the subset choice, we decided to train QA models on the entire PA pairs set with the generated questions.Also, we train QA models solely on the synthetic data, which differs from work in semi-supervised QA where the QA model is trained on a concatenation of the synthetic data and the original SQuAD training set (Lee et al., 2020).
The synthetic QA data is created by generating a question for each of the one million PA pairs (Du and Cardie, 2018) with the target QG model.We then fine-tune BERT (Devlin et al., 2019) 13 on the synthetic QA data with the default configuration used in the HuggingFace's tutorial to fine-tune BERT on QA. 14 We report F1 score and the exact match on the SQuAD validation set, following Zhang and Bansal (2019). 15 The results of our unsupervised QA-based evaluation in Table 10 indicate that the QA model accuracy correlates with the size of QG model that generated the synthetic data, as in T5 LARGE realizes the best QA model in both of F1 and the exact match, which is as good as the supervised non-language model based QA models (Wang and Jiang, 2016;   13 We use bert-base-cased from HuggingFace. 14https://github.com/huggingface/transformers/tree/master/examples/pytorch/question-answering 15 We will release the synthetic data we made on Huggingface Dataset https://huggingface.co/datasets.Yang et al., 2017).Also, the small models such as T5 SMALL and BART BASE produce QA models with a small decrease in performance, which exhibits the efficiency of our models, similarly to our results with automatic metrics.We fine-tune multilingual language model on each of multilingual QG dataset in § 5.2, and here we explore the zero-shot multilingual transfer by evaluating English fine-tuned QG model in other languages.Table 11 shows the zero-shot transfer result where we fine-tune mT5 SMALL on SQuAD and evaluate it on the test set of multilingual QG dataset.Compared with Table 3, the performance is largely decreased, indicating the difficulty of zero-shot multilingual transfer in QG.

Figure 3 :
Figure 3: Spearman's rank correlation over all the generated questions within the manual evaluation.

Figure 4 :
Figure 4: Comparison of METEOR (MTR) scores for T5 LARGE across in-domain fine-tuning, zero-shot transfer of SQuAD fine-tuned model, and in-domain finetuning from SQuAD model.

Figure 6 :
Figure 6: Spearman's rank correlation within automatic evaluation metrics among the 500 samples we used in SQuAD manual annotation.

Figure 5 and
Figure 5 and Figure 6 show Spearman's rank correlation across automatic metrics and manual evaluation criteria among the questions we generate over SQuAD test set for the manual annotation.The p-values of all those correlation are less than 0.05 so they are statistically significant.

Figure 7 :
Figure 7: Williams test on the difference in the correlation reported in Figure 3.The difference of correlation is significant if the value is less than 0.005.

Figure 9
Figure 9 shows the comparison of zero-shot QG transfer in SQuADShifts and SubjQA dataset with T5 LARGE .

Figure 8 :
Figure 8: An example of the interface used in our manual evaluation.

Figure 9 :
Figure 9: Metric comparison for T5 LARGE across indomain fine-tuning, zero-shot transfer of SQuAD finetuned model, and in-domain fine-tuning from SQuAD model.

Table 2 :
QG model fine-tuning results on the test set of SQuAD where the best result in each metric is in bold face.The results in the top row group are existing

Table 3 :
QG model fine-tuning results on the test set of all language-specific QG-Bench datasets where the best result in each language is in bold face..9422.78 92.77 63.25 BART LARGE 9.80 28.69 23.79 92.4963.31 T5 SMALL 8.41 27.04 22.17 91.89 62.11 T5 BASE 9.80 28.94 23.85 92.43 63.27 T5 LARGE 10.42 29.51 24.39 92.65 63.71 Wiki BART BASE 11.50 29.00 26.60 93.12 65.86 BART LARGE 12.12 29.94 27.12 93.39 66.22 training set.As we see in § 3.2, some datasets such as German and French have a limited amount of training instances, resulting in underfitting models for those languages.In general, the low scores in non-English datasets can be attributed to the under-

Table 4 :
QG model fine-tuning results on the test set of SQuADShifts and SubjQA where the best result in each metric is in bold face.
Domain-specific Datasets.Table4shows the results from all domain-specific datasets included in QG-Bench: SQuADShifts and SubjQA.Since each domain contains a small training set, our main strategy to achieve domain-specific QG models is to

Table 5 :
QG model fine-tuning results on the test set of SQuAD for answer-free and sentence/paragraph-level QG models.The best overall result for each metric is in boldface.
Michael Heilman and Noah A. Smith.2010.Good question!statistical ranking for question generation.In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 609-617, Los Angeles, California.Association for Computational Linguistics.

Table 7 :
The best parameter to fine-tune each model on SQuAD we found through the parameter optimization.

Table 9 :
Examples of the system outputs along with their scores from the manual evaluation.The sentence and answer are highlighted by boldface and underline in the paragraph.

Table 10 :
Unsupervised QA-based evaluation results of our answer-aware QG models (paragraph-level).All results are the performance on the validation set of original SQuAD by the model trained on the synthetic data generated by each QG model.
.1 Zero-shot Multilingual Transfer D

Table 11 :
Zero-shot result of mT5 fine-tuned on SQuAD except for the first row, which shows fine-tuning result of SQuAD.