Improving Unsupervised Question Answering via Summarization-Informed Question Generation

Question Generation (QG) is the task of generating a plausible question for a given  pair. Template-based QG uses linguistically-informed heuristics to transform declarative sentences into interrogatives, whereas supervised QG uses existing Question Answering (QA) datasets to train a system to generate a question given a passage and an answer. A disadvantage of the heuristic approach is that the generated questions are heavily tied to their declarative counterparts. A disadvantage of the supervised approach is that they are heavily tied to the domain/language of the QA dataset used as training data. In order to overcome these shortcomings, we propose a distantly-supervised QG method which uses questions generated heuristically from summaries as a source of training data for a QG system. We make use of freely available news summary data, transforming declarative summary sentences into appropriate questions using heuristics informed by dependency parsing, named entity recognition and semantic role labeling. The resulting questions are then combined with the original news articles to train an end-to-end neural QG model. We extrinsically evaluate our approach using unsupervised QA: our QG model is used to generate synthetic QA pairs for training a QA model. Experimental results show that, trained with only 20k English Wikipedia-based synthetic QA pairs, the QA model substantially outperforms previous unsupervised models on three in-domain datasets (SQuAD1.1, Natural Questions, TriviaQA) and three out-of-domain datasets (NewsQA, BioASQ, DuoRC), demonstrating the transferability of the approach.

Question Generation (QG) is the task of generating a plausible question for a given <passage, answer> pair. Template-based QG uses linguistically-informed heuristics to transform declarative sentences into interrogatives, whereas supervised QG uses existing Question Answering (QA) datasets to train a system to generate a question given a passage and an answer. A disadvantage of the heuristic approach is that the generated questions are heavily tied to their declarative counterparts. A disadvantage of the supervised approach is that they are heavily tied to the domain/language of the QA dataset used as training data. In order to overcome these shortcomings, we propose an unsupervised QG method which uses questions generated heuristically from summaries as a source of training data for a QG system. We make use of freely available news summary data, transforming declarative summary sentences into appropriate questions using heuristics informed by dependency parsing, named entity recognition and semantic role labeling. The resulting questions are then combined with the original news articles to train an end-to-end neural QG model. We extrinsically evaluate our approach using unsupervised QA: our QG model is used to generate synthetic QA pairs for training a QA model. Experimental results show that, trained with only 20k English Wikipedia-based synthetic QA pairs, the QA model substantially outperforms previous unsupervised models on three in-domain datasets (SQuAD1.1, Natural Questions, TriviaQA) and three out-of-domain datasets (NewsQA, BioASQ, DuoRC), demonstrating the transferability of the approach.

Introduction
The aim of Question Generation (QG) is the production of meaningful questions given a set of input passages and corresponding answers, a task with many applications including dialogue systems as Figure 1: Example questions generated via heuristics informed by semantic role labeling of summary sentences using different candidate answer spans well as education (Graesser et al., 2005). Additionally, QG can be applied to Question Answering (QA) for the purpose of data augmentation (Puri et al., 2020) where labeled <passage, answer, ques-tion> triples are combined with synthetic <passage, answer, question> triples produced by a QG system to train a QA system, and unsupervised QA (Lewis et al., 2019), in which only the QG system output is used to train the QA system.
Early work on QG focused on template or rulebased approaches, employing syntactic knowledge to manipulate constituents in declarative sentences to form interrogatives Smith, 2009, 2010). Although template-based methods are capable of generating linguistically correct questions, the resulting questions often lack variety and incur high lexical overlap with corresponding declarative sentences. For example, the question generated from the sentence Stephen Hawking announced the party in the morning, with Stephen Hawking as the candidate answer span, could be Who announced the party in the morning?, with a high level of lexical overlap between the generated question and the declarative sentence. This is undesirable in a QA system (Hong et al., 2020) since the strong lexical clues in the question would make it a poor test of real comprehension.
Neural seq2seq models (Sutskever et al., 2014) have come to dominate QG (Du et al., 2017), and are commonly trained with <passage, answer, question> triples taken from human-created QA datasets (Dzendzik et al., 2021) and this limits applications to the domain and language of datasets. Furthermore, the process of constructing such datasets involves a significant investment of time and resources.
We subsequently propose a new unsupervised approach that frames QG as a summarizationquestioning process. By employing freely available summary data, we firstly apply dependency parsing, named entity recognition and semantic role labeling to summaries, before applying a set of heuristics that generate questions based on parsed summaries. An end-to-end neural generation system is then trained employing the original news articles as input and the heuristically generated questions as target output.
An example is shown in Figure 1. The summary is used as a bridge between the questions and passages. Because the questions are generated from the summaries and not from the original passages, they have less of a lexical overlap with the passages. Crucially, however, they remain semantically close to the passages since the summaries by definition contain the most important information contained in the passages. A second advantage of this QG approach is that it does not rely on the existence of a QA dataset, and it is arguably easier to obtain summary data in a given language than equivalent QA data since summary data is created for many purposes (e.g. news, review and thesis summaries) whereas many QA datasets are created specifically for training a QA system.
In order to explore the effectiveness of our method, we carry out extensive experiments. We provide an extrinsic evaluation, and train an English QG model using news summary data. We employ our QG model to generate synthetic QA data to train a QA model in an unsupervised setting and test the approach with six English QA datasets: SQuAD1.1, Natural Questions, TriviaQA, NewsQA, BioASQ and DuoRC (Rajpurkar et al., 2016;Kwiatkowski et al., 2019;Joshi et al., 2017;Trischler et al., 2017;Tsatsaronis et al., 2015;Saha et al., 2018). Experiment results show that our approach substantially improves over previous unsupervised QA models even when trained on substantially fewer synthetic QA examples.
Our contributions can be summarized as follows: 1. We propose a novel unsupervised QG approach that employs summary data and syntactic/semantic analysis, which to our best knowledge is the first work connecting text summarization and question generation in this way; 2. We employ our QG model to generate synthetic QA data achieving state-of-the-art performance even at low volumes of synthetic training data.

Related Work
Question Generation Traditional approaches to QG mostly employ linguistic templates and rules to transform declarative sentences into interrogatives Smith, 2009). Recently, Dhole andManning (2020) showed that, with the help of advanced neural syntactic parsers, template-based methods are capable of generating high-quality questions from texts. Neural seq2seq generation models have additionally been widely employed in QG, with QG data usually borrowed from existing QA datasets (Du et al., 2017;Sun et al., 2018;. Furthermore, reinforcement learning has been employed by Zhang and Bansal (2019); Chen et al. (2019); Xie et al. (2020) to directly optimize discrete evaluation metrics such as BLEU (Papineni et al., 2002). Lewis et al. (2020) and Song et al. (2019) show that a large-scale pre-trained model can achieve state-of-the-art performance for supervised QG (Dong et al., 2019;Narayan et al., 2020). ineni et al., 2002), ROUGE (Lin, 2004) and Meteor (Banerjee and Lavie, 2005) metrics are commonly borrowed from text generation tasks to evaluate QG. Even with respect to original text generation tasks, however, the use of such metrics has been questioned (Callison-Burch et al., 2006;Reiter, 2018). Such metrics are particularly problematic for QG evaluation since multiple plausible questions exist for a given passage and answer. Consequently, there has been a shift in focus to evaluating QG using an extrinsic evaluation that generates synthetic QA pairs for the purpose of evaluating their effectiveness as a data augmentation or unsupervised QA approach Puri et al., 2020;Shakeri et al., 2020). Unsupervised QA In unsupervised QA, the QA model is trained using synthetic data based on a QG model instead of an existing QA dataset. Instead of resorting to existing QA datasets, unsupervised QG methods have been employed, such as Unsupervised Neural Machine Translation (Lewis et al., 2019). Fabbri et al. (2020) and  propose template/rule-based methods for generating questions and employ retrieved paragraphs and cited passages as source passages to alleviate the problems of lexical similarities between passages and questions. ; Puri et al. (2020); Shakeri et al. (2020) additionally employ existing QA datasets to train a QG model. Although related, this work falls outside the scope of unsupervised QA.

Methodology
Diverging from supervised neural question generation models trained on existing QA datasets, the approach we propose employs synthetic QG data, that we create from summary data using a number of heuristics, to train a QG model. We provide an overview of the proposed method is shown in Figure 2. We then employ the trained QG model to generate synthetic QA data that is further employed to train an unsupervised QA model.

Question Generation
In order to avoid generating trivial questions that are highly similar to corresponding declarative statements, we employ summary data as a bridge connecting the generated question and the original article. 1 The process we employ involves, firstly Dependency Parsing (DP) of summary sentences, followed by Named-Entity Recognition (NER) and finally Semantic Role Labeling (SRL). DP is firstly employed as a means of identifying the main verb (root verb), in addition to other constituents such as auxiliaries. NER is then responsible for tagging all entities in the summary sentence to facilitate discovery of the most appropriate question words to generate. The pivotal component of linguistic analysis is then SRL, employed to obtain all semantic frames for the summary sentence. Each frame consists of a verb followed by a set of arguments which correspond to phrases in the sentence. An argument could comprise, for example, an Agent (who initiates the action described by the verb), a Patient (who undergoes the action), and a set of modifier arguments such as a temporal ARG-TMP or locative argument ARG-LOC. Questions are then generated from the arguments according to argument type and NER tags, which means that wh-words can be determined jointly.
Returning to the example in Figure 1: given the SRL analysis [U2's lead singer Bono ARG-0] has [had VERB] [emergency spinal surgery ARG-1] [after suffering an injury while preparing for tour dates ARG-TMP]., the three questions shown in Figure 1 can be generated based on these three arguments.
The pseudocode for our algorithm to generate questions is shown in Algorithm 1. We first ob-Algorithm 1: Question Generation Heuristics if root_verb equal to verb then for arg in frame do wh * = identif y_wh_word(arg, ners) base_verb, auxs = decomp_verb(arg, dps, root_verb) Qarg = wh_move(S, wh * , base_verb, auxs) Qarg = post_edit(Qarg) examples.append(context, Qarg, arg) end end end tain all dependency edges and labels (dps), NER tags (ners) and SRL frames (srl_f rames) of a summary sentence. We then iterate through all arguments in the frame of the root_verb (the verb whose dependency label is root) and identify appropriate wh-words (wh * ) for each argument using the function identif y_wh_word according to its argument type and the NER tags of entities in the argument. We follow Dhole and Manning (2020) to use the standard wh-words in English associated with appropriate argument types and NER tags. We then decompose the current main verb to its base form (base_verb) and appropriate auxiliary words (auxs) in the decomp_verb function, before finally inserting the wh-words and the auxiliary verbs in appropriate positions using the wh_move. As can be seen from Figure 1, a single summary sentence generates multiple questions when its SRL frame has multiple arguments.

Training a Question Generation Model
The summarization data we employ consists of <passage-summary> pairs. Questions are generated from the summaries using the heuristics described in Section 3.1, so that we have <passage-summary> pairs and <summary-question-answer> triples, which we then combine to form <passage-answer-question> triples to train a QG model. We train an end-to-end seq2seq model rather than deploying a pipeline in which the summary is first generated followed by the question to eliminate the risk of error accumulation in the generation process. By using this QG data to train a neural generation model, we expect the model to learn a combination of summarization and question generation. In other words, such knowledge can be implicitly injected into the neural generation model via our QG data.
To train the question generation model, we concatenate each passage and answer to form a sequence: passage <SEP> answer <SEP>, where <SEP> is a special token used to separate the passage and answer. This sequence is the input and the question is the target output (objective). In our experiments, we use BART (Lewis et al., 2020) for generation, which is optimized by the following negative log likelihood loss function: where q i is the i-th token in the question, and C and A are context and answer, respectively.

Experiments
We test our idea of using summaries in question generation by applying the questions generated by our QG system in unsupervised QA. We describe the details of our experiment setup, followed by our unsupervised QA results on six English benchmark extractive QA datasets.  et al., 2017) to obtain dependency trees, named entities and semantic role labels for summary sentences, before further employing this knowledge to generate questions from summaries following the algorithm described in Section 3.1. We remove any generated <passage-answer-question> triples that meet one or more of the following three conditions: 1. Articles longer than 480 tokens (exceeding the maximum BART input length); 2. Articles in which fewer than 55% of tokens in the answer span are not additionally present in the passage (to ensure sufficient lexical overlap between the answer and passage); 3. Questions shorter than 5 tokens (very short questions are likely to have removed too much information) For the dataset in question, this process resulted in a total of 14,830 <passage-answer-question> triples.
For training the QG model, we employ implementations of BART (Lewis et al., 2020) from Huggingface (Wolf et al., 2019). The QG model we employ is BART-base. We train the QG model on the QG data for 3 epochs with a learning rate of 3 × 10 −5 , using the AdamW optimizer (Loshchilov and Hutter, 2019).
Unsupervised QA Training Details To generate synthetic QA training data, we make use of Wikidumps 4 by firstly removing all HTML tags and reference links, then extracting paragraphs that are longer than 500 characters, resulting in 60k paragraphs sampled from all paragraphs of Wikidumps. We employ the NER toolkits of Spacy 5 (Honnibal et al., 2020) and Al-lenNLP 6 (Gardner et al., 2017) to extract entity mentions in the paragraphs. We then remove paragraph, answer pairs that meet one or more of the following three conditions: 1) paragraphs with less than 20 words and more than 480 words; 2) paragraphs with no extracted answer, or where the extracted answer is not in the paragraph due to text tokenization; 3) answers consisting of a single pronoun.
Paragraphs and answers are concatenated to form sequences of the form passage <SEP> answer <SEP>, before being fed into the trained BART-QG model to obtain corresponding questions. This results in 20k synthetic QA pairs, which are then employed to train an unsupervised QA model.
The QA model we employ is BERT-large-wholeword-masking (which we henceforth refer to as BERT-large for ease of reference). Document length and stride length are 364 and 128 respectively, the learning rate is set to 1 × 10 −5 . Evaluation metrics for unsupervised QA are Exact Match (EM) and F-1 score.

Results
We use the 20k generated synthetic QA pairs to train a BERT QA model and first validate its performance on the development sets of three benchmark QA datasets based on Wikipedia -SQuAD1.1, Natural Questions and TriviaQA. The results of our method are shown in Tables 1 and 2. The unsupervised baselines we compare with are as follows: 1. Lewis et al. (2019) employ unsupervised neural machine translation (Artetxe et al., 2018) to train a QG model; 4M synthetic QA examples were generated to train a QA model; 2.  employ dependency trees to generate questions and employed cited documents as passages.
For comparison, we also show the results of some supervised models fine-tuned on the correspond-    Table 1. The results of all baseline models are taken directly from published work. As can be seen from results in Table 1, our proposed method outperforms all unsupervised baselines, and even exceeds the performance of one supervised model, Match-LSTM (Wang and Jiang).
Results for Natural Questions and TriviaQA are shown in Table 2. The results of all baseline models were produced using the released synthetic QA data to finetune a BERT-large model. Our method outperforms previous state-of-the-art unsupervised methods by a substantial margin, obtaining relative improvements over the best unsupervised baseline model of 47% with respect to EM, 10% F-1 on Natural Questions, and by 34% EM and 12% F-1 on TriviaQA.
In summary, our method achieves the best performance (both in terms of EM and F-1) out of three unsupervised models on all three tested datasets. Furthermore, this high performance is possible with as few as 20k training examples. Compared to previous work, this is approximately less than 10% of the training data employed .
Transferability of Our Generated Synthetic QA Data We also validate our method's efficacy on three out-of-domain QA datasets: NewsQA created from news articles, BioASQ created from biomedical articles, and DuoRC created from movie plots, for the purpose of evaluating the transferability of the Wikipedia-based synthetic data. Results in Table 3 show that our proposed method additionally outperforms the unsupervised baseline models on the out-of-domain datasets, achieving F1 improvements over previous state-of-the-art methods by  3.8, 4.5 and 5.4 points respectively. It is worth noting that our data adapts very well to DuoRC, created from movie plots where the narrative style is expected to require more complex reasoning. Experiment results additionally indicate that our generated synthetic data transfers well to domains distinct from that of the original summary data.

Effect of Answer Extraction
In the unsupervised QA experiments, we extracted answers from Wikipedia passages before feeding them into our QG model to obtain questions. These <passage, answer, question> triples constitute the synthetic data employed to train the QA model. Additionally, we wish to consider what might happen if we instead employ passages and answers taken directly from the QA training data? Doing this would mean that the QA system is no longer considered unsupervised but we carry out this experiment in order to provide insight into the degree to which there may be room for improvement in terms of our NER-based automatic answer extraction method (described in Section 4.1.2). For example, there could well be a gap between the NER-extracted answers and human-extracted answers, and in this case, the NER could extract answers, for example, that are not entirely worth asking about or indeed miss answers that are highly likely to be asked about. Results of the two additional settings are shown in Table 5 -answer extraction has quite a large effect on the quality of generated synthetic QA data. When we employ the answers from the training set, the performance of the QA model is improved by 5 F-1 points for SQuAD1.1, and over 10 F-1 points for Natural Questions and TriviaQA. 2009 factual error what has been described as a " giant fish " ?
Darwin mismatch  Table 5: Comparison between synthetic data generated based on Wikipedia and synthetic data generated based on corresponding training set. †are results of QA model finetuned on synthetic data generated based on NERextracted answers, ‡are results of QA model finetuned on synthetic data based on the answers in the training set of SQuAD1.1, NewsQA, NQ and TriviaQA.

Effect of Different Heuristics
We additionally investigate the effect of a range of alternate heuristics employed in the process of constructing the QG training data described in Section 3.1. Recall that the QG data is employed to train a question generator which is then employed to generate synthetic QA data for unsupervised QA.
The heuristics are defined as follows: • Naive-QG only employs summary sentences as passages (instead of the original articles) and generates trivial questions in which only the answer spans are replaced with the appropriate question words. For example, for the sentence Stephen Hawking announced the party in the morning, with the party as the answer span, the question generated by Naive-QG would be Stephen Hawking announced what in the morning? We employ the summary sentences as input and questions as target output to form the QG training data. • Summary-QG makes use of the original news articles of the summaries as passages rather than summary sentences to avoid high lexical overlap between the passage and question.
Summary-QG can work with the following heuristics: -  We employ the QG data generated by these heuristics to train QG models, which leads to six BART-QG models. We then employ these six models to further generate synthetic QA data based on the same Wikipedia data and compare their performances on the SQuAD1.1 dev set. The results in Table 6 show that using articles as passages to avoid lexical overlap with their summarygenerated questions greatly improves QA performance. Summary-QG outperforms Naive-QG by roughly 20 EM points and 16 F-1 points. The results for the other heuristics show that they continuously improve the performance, especially Wh-Movement and Decomp-Verb which make the questions in the QG data more similar to the questions in the QA dataset.

Effect of the Size of Synthetic QA Data
We investigate the effects of varying the quantity of synthetic QA data. Results in Figure 3 show that our synthetic data allows the QA model to achieve competitive performance even with fewer than 20k examples, which suggests that our synthetic data contains sufficient QA knowledge to enable models to correctly answer a question with less synthetic data compared to previous unsupervised methods. The data-efficiency of our approach increases the feasibility of training a QA system for a target domain where there is no labeled QA data available.

Few-shot Learning
We conduct experiments in a few-shot learning setting, in which we employ a limited number of labeled QA examples from the training set. We take the model trained with our synthetic QA data, the model trained with the synthetic QA data of  and a vanilla BERT model, with all QA models employing BERT-large (Devlin et al., 2019). We train these models using progressively increasing amounts of labeled QA samples from Natural Questions (NQ) and SQuAD1.1 and assess their performance on corresponding dev sets. Results are shown in Figure 4 where with only a small amount of labeled data (less than 5,000 examples), our method outperforms  and BERT-large, clearly demonstrating the efficacy of our approach in a few-shot learning setting.

QG Error Analysis
Despite substantial improvements over baselines, our proposed approach inevitably still incurs error and we therefore take a closer look at the questions generated by our QG model. We manually examine 50 randomly selected questions, 31 (62%) of which were deemed high quality questions. The remaining 19 contain various errors with some questions containing more than one error, including mismatched wh-word and answer (12%), missing information needed to locate the answer (8%), factual errors (10%) and grammatical errors (8) (16%) Typical examples are shown in Table 4.

Conclusion
We propose an unsupervised question generation method which uses summarization data to 1) minimize the lexical overlap between passage and question and 2) provide a QA-dataset-independent way of generating questions. Our unsupervised QA extrinsic evaluation on SQuAD1.1, NQ and TriviaQA using synthetic QA data generated by our method shows that our method substantially outperforms previous methods for generating synthetic QA for unsupervised QA. Furthermore, our synthetic QA data transfers well to the out-of-domain datasets. Future work includes refining our question generation heuristics and applying our approach to other languages.  We also study the effects of different beam size in generating synthetic questions to the performance of downstream QA task. Experiments are conducted on SQuAD1.1 dev set using BERT-large, questions in the synthetic QA data are generated with different beam size using the same BART-QG model. The experimental results in Figure 5 show that the beam size is an important factor affecting the performance of unsupervised QA, the largest margin between the highest score (beam-15) and the lowest score (beam-1) in Figure 5 is close to 4 points on EM and F-1 score.  We show the distribution of question types of QG data described in Section 4.1.1, training set of SQuAD1.1 and our synthetic QA data in Section 4.1.2 in Figure 6, question types are defined as What, When, Where, Who, Why, How. The QG data has more what, when, where questions, indicating the existence of more SRL arguments associated with such question types in the summary sentences.

A.3 Generated QA Examples
Some Wikipedia-based <passage, answer, ques-tion> examples generated by our BART-QG model are shown in Table 7, Table 8 and Table 9 Elias Boudinot who has been working for a company that made coins for the us mint ?

Passage
Answer Question In March 2008 as part of the annual budget, the government introduced several laws to amend the Immigration and Refugee Protection Act. The changes would have helped to streamline immigrant application back-up, to speed up application for skilled workers and to rapidly reject other ones that are judged not admissible by immigration officers. Immigrant applications had risen to a high of 500,000, creating a delay of up to six months for an application to be processed.

March 2008
when did the uk introduce new immigration laws ?
The other group members as far back as 1996 had noticed Paddy Clancy's unusual mood swings. In the spring of 1998 the cause was finally detected; Paddy had a brain tumor as well as lung cancer. His wife waited to tell him about the lung cancer, so as not to discourage him when he had a brain operation.
the spring of 1998 in what time was paddy diagnosed with lung cancer ?
In 1365 officials were created to supervise the fish market in the town, whilst illegal fishing and oyster cultivation was targeted by the bailiffs in an edict from 1382, which prohibited the forestalling of fish by blocking the river, the dredging of oysters out of season and the obstructing of the river. Colchester artisans included clockmakers, who maintained clocks in church towers across north Essex and Suffolk.
north Essex where were hundreds of clocks made by local artisans ?
Badge numbers for Sheriffs and Deputies consist of a prefix number, which represents the county number, followed by a one to three digit number, which represents the Sheriff's or Deputy's number within that specific office. The Sheriff's badge number in each county is always #1. So the Sheriff from Bremer County would have an ID number of 9-1 (9 is the county number for Bremer County and 1 is the number for the Sheriff).

The Sheriff's badge number
what is the number used to identify the sheriff in each county ?

Passage
Answer Question Appian wrote that Calpurnius Piso was sent as a commander to Hispania because there were revolts. The following year Servius Galba was sent without soldiers because the Romans were busy with Cimbrian War and a slave rebellion in Sicily (the [Third Servile War], 104-100 BC). In the former war the Germanic tribes of the Cimbri and the Teutones migrated around Europe and invaded territories of allies of Rome, particularly in southern France, and routed the Romans in several battles until their final defeat.
Calpurnius Piso who was sent to the south of italy to fight for the roman empire ?
The parish churches of Sempringham, Birthorpe, Billingborough, and Kirkby were already appropriated. Yet in 1247, Pope Innocent IV granted to the master the right to appropriate the church of Horbling, because there were 200 women in the priory who often lacked the necessaries of life. The legal expenses of the order at the papal curia perhaps accounted for their poverty.

200
there were how many women in the priory of horbling in the 12th century ?
"Jerry West is the reason I came to the Lakers", O'Neal later said. Finnish popular music also includes various kinds of dance music; tango, a style of Argentine music, is also popular. One of the most productive composers of popular music was Toivo Kärki, and the most famous singer Olavi Virta . Among the lyricists, Sauvo Puhtila , Reino Helismaa (died 1965) and Veikko "Vexi" Salmi are a few of the most notable writers. The composer and bandleader Jimi Tenor is well known for his brand of retro-funk music.
Reino Helismaa who has been hailed as one of finland 's most important writers ?