Consecutive Question Generation via Dynamic Multitask Learning

In this paper, we propose the task of consecutive question generation (CQG), which generates a set of logically related question-answer pairs to understand a whole passage, with a comprehensive consideration of the aspects including accuracy, coverage, and informativeness. To achieve this, we first examine the four key elements of CQG, i.e., question, answer, rationale, and context history, and propose a novel dynamic multitask framework with one main task generating a question-answer pair, and four auxiliary tasks generating other elements. It directly helps the model generate good questions through both joint training and self-reranking. At the same time, to fully explore the worth-asking information in a given passage, we make use of the reranking losses to sample the rationales and search for the best question series globally. Finally, we measure our strategy by QA data augmentation and manual evaluation, as well as a novel application of generated question-answer pairs on DocNLI. We prove that our strategy can improve question generation significantly and benefit multiple related NLP tasks.


Introduction
Question Generation (QG) is an important and promising task in natural language generation (NLG).It has long served as an effective way to improve other NLP tasks.The applications of synthetic questions have expanded from QA data augmentation (Duan et al., 2017;Lewis et al., 2021) to building tutoring or dialogue systems (Lindberg et al., 2013;Bordes and Weston, 2017), self-assessing the ability of language models (Sun et al., 2019), and checking the faithfulness of an abstract summary (Durmus et al., 2020), etc.
Traditionally, syntax-based methods such as semantic parsing are commonly adopted to synthe-Table 1: Example QG results using a two-step inconsecutive method based on extractive answers.size questions (Berant et al., 2013;Khullar et al., 2018).Recently, transformer-based pre-trained language models (Vaswani et al., 2017;Devlin et al., 2019) are widely used to generate questions.Most of these works are two-step QG methods (Sun et al., 2018;Rennie et al., 2020), which rely on ground-truth or pre-extracted answers (Wang et al., 2019;Jia et al., 2020) and generate questions independently (Puri et al., 2020;Bartolo et al., 2021).However, in real scenarios such as daily conversations or reading comprehension, we usually raise several questions consecutively to understand a whole story.Current QG methods are inadequate to generate such questions, as Table 1 shows.We can see that there are no logical connections between the questions (e.g., Q3 and Q1) and pre-extracted answers also lead to simplicity (e.g., Q1) and inconsistency (e.g., Q3).
In such cases, we propose the task of consecutive question generation (CQG), which automatically produces a set of well-ordered and logically related question-answer (Q-A) pairs to help understand a given passage (or story).Table 2 shows several "ideal" questions which are mutually connected and cover diverse information in the text.To achieve this, unlike traditional QG methods, which mainly focus on "what are good questions given an answer", our CQG also requires a model to automatically find "which information in a text is worth-asking".Additionally, since we pose questions not only to get separate information, but to understand a whole story, we propose three key qualities simultaneously to evaluate consecutive questions, i.e., accuracy, coverage, and informativeness.
With these demands, we propose an integrated dynamic multitask framework, with five unified Seq2Seq generation tasks.One main task generates Q-A pairs and four auxiliary tasks make full use of the generation of four key CQG elements (i.e., question, answer, rationale, and context history).We link the qualities of key aspects with the inference losses of four auxiliary tasks respectively.Based on it, we then design four distinct methods to improve the model performance from all aspects and from all stages during training and inference.
The five tasks are jointly trained in one model to help it learn from different views.In inference, the main task generates candidates and then the auxiliary tasks self-rerank them, improving Q-A accuracy, coverage, and informativeness all-roundly.To fully exploit the worth-asking information in each sentence and generate questions properly and dynamically, we propose a novel rationale sampling method and sentence-level beam-search.We recompose the context history reranking losses to measure the information in each rationale, and then design a sample probability to guarantee that the more information a rationale leaves, the more likely it is asked once again.To relieve the error cascade and guide the direction of a Q-A flow, we reinvent beam-search to sentence-level, which rearranges the total reranking results and seeks the global optimum Q-A series for a whole passage.
Finally, we conduct abundant experiments to augment various QA datasets, only using the model trained on CoQA.We also make a manual evaluation and propose a novel zero-shot method for document-level NLI task (Yin et al., 2021) using question generation.Successfully, we promote the performance on multiple QA scenes and prove the expansibility of our model on different NLP tasks.

Related Work
Question generation is a promising task which has been well studied in many researches.Initially, rule-based or traditional machine learning methods are widely used in producing questions.Heilman and Smith (2010) adopt verb transformations and Berant et al. (2013) use semantic parsing to synthesize questions.Recently, deep learning techniques have given a further development of question generation.Du et al. (2017) use an LSTM (Hochreiter andSchmidhuber, 1997) model, andSultan et al. (2020) adopt RoBERTa (Liu et al., 2019) model to generate questions.
At the same time, the strategies like multitask learning and self-training have been applied to improve the quality of generated questions.Zhou et al. (2019) and Ma et al. (2020) employ a multitask structure to generate coherent and fluent questions.Sachan and Xing (2018) and Rennie et al. (2020) adopt a self-training strategy to jointly learn to ask and answer questions.Alberti et al. (2019) use roundtrip consistency to filter out inconsistent results.Shinoda et al. (2021) generate noisy data and Sultan et al. ( 2020) employ nucleus sampling (Holtzman et al., 2020) to improve the diversity of questions.However, they mainly focus on only one quality aspect and most of them are based on pre-defined answers or original data.
As QG can produce meaningful questions, it has been widely used to promote other NLP tasks.Liu et al. (2020) use a constrained question rewriting way to generate new data for QA tasks.Wang et al. (2020) and Nan et al. (2021) check the faithfulness of summaries through answering generated questions.Pan et al. (2021) generate question-answer pairs and convert them for fact verification.Nevertheless, the researches above mainly produce each question independently and ignore the connections between questions.
As for generating a set of questions over a specific passage, Krishna and Iyyer (2019) propose a pipelined system to ask different levels of questions from general to specific.Lee et al. (2020) use conditional variational autoencoder to generate multiple robust questions for a given paragraph.Similar to us, Chai and Wan (2020) generate sequential and related questions under dual-graph interaction, but use ground-truth answers.To the best of our knowledge, we are the first to consecutively synthesize a series of connected questionanswer pairs to understand an entire passage, with the comprehensive consideration of accuracy, coverage, and informativeness.

Task
Input Output Table 2: An ideal CQG example, where the questions are mutually connected and can cover diverse information to help understand the whole story.Also an example of data composition of our multitask generation framework, as well as the input and output in the n th generation step.In this example, the output of Task h is stc 1 when n = 1, and is stc 1 stc 2 when n ≥ 2. " ∪ " means coverage, or union set, with no overlap or replication.
In section 4 we use them to compose four related methods to enhance different stages.
We first symbolically define the four key elements used in our work.S denotes the story from which questions are produced; Q n means the n th question and A n is the answer; R n is the corresponding rationale (always one sentence) based on which Q n is generated.Since the Q-A pairs are generated dependently on previous questions, C n denotes the context which composes of previous n − 1 Q-A pairs and the story. 3Table 2 is an example.Then we define the main task and the four auxiliary tasks using the n th turn as follows: In Task main, because we think the extractive answer is usually simple and it is inconsistent to get a Q-A in two steps, different from traditional methods, we input the context and rationale and output the question and answer simultaneously.
The design of Task a and Task q aims to guarantee that the generated question and answer are accurate: given the question we can get the answer and given the answer we can get the question.Here Task a follows traditional QA form.We do not input the rationale in Task q because previous Q-A pairs are included in the context, so if A n is 3 Please be aware that story is the text content, and context is story plus previous n − 1 Q-A pairs.an accurate answer, the model should recognize the connection between the answer and the previous Q-A pairs, and restore the question easily.
Moreover, although we input the rationale in Task main, it does not necessarily imply that the question-answer pair is derived from it.So we design Task r (C n + Q n + A n → R n ) to verify that the model indeed uses the information in input rationale to get the question and answer.Task r helps the model to recognize the corresponding rationale, and then increase the coverage of a Q-A series, which means more events or more segments are precisely referred to.
Finally, to generate an informative and useful question, which means the knowledge it asks for does not overlap with previous ones, we consider that the more unseen information included in the Q-A pair, the better.We introduce the history of the context as the coverage of all previous rationales, which represents the total background information till the current Q-A turn.Therewith, we present Task h: ∪ n i=1 R i , which uses Q-A pairs to restore the history." ∪ " means cover, with no overlap or replication, and "+" means append or plus.
Both Task r and Task h use Q-A pairs to restore the context, but focus on coverage and informativeness differently.Specifically, a part of a story is covered means a question is asked based on it, but a informative question means it is non-trivial and important and contains no repetitive information.Also, in Task r we input the context, so the model  In inference, the model uses the main task to generate candidates and then uses the auxiliary tasks to self-rerank them.We use the n th turn of a series of questions as an example and generate 4 candidates in inference.j ∈ {1, 2, 3, 4}.
only needs to locate the correct rationale, but in Task h, it has to generate the history completely based on Q-A pairs.Therefore in Task h, if the n th Q-A pair carries more unseen information, it will be easier to restore the history compared with a Q-A pair with repetitive or trivial information.

Training and Inference
Based on the dynamic multitask framework, we jointly train a BART (Lewis et al., 2020) model.
In inference, we use the main task to generate several candidates and self-rerank them using the auxiliary tasks.With the reranking losses, we design a formula to assess the information and automatically sample the rationales.Globally, we beamsearch for the best Q-A series on sentence level.

Joint Training
We randomly shuffle the five kinds of training instances and use a BART model to jointly train the five tasks together.We also train the model to generate a "?" between a Q-A to split it, and adopt five hand-made prompts (Liu et al., 2021).Table 2 shows an example of our data structure.Given the Seq2Seq model parameterized by θ, the input sequence x with n tokens = {x 1 , • • • , x n } and label y with m tokens = {y 1 , • • • , y m }, the generation probability and loss are as follows: Through joint training we train a model to learn from different views and allow every task to benefit each other mutually.We also acquire the ability to do all five tasks in one model.

Self-Reranking
During the inference stage, through the main task we can obtain many candidate question-answer pairs using a decoding strategy like nucleus sampling.To select the best result, inspired by Shen et al. (2021), we employ these candidates to the same model to do Task a,q,r, and h, and then rank the candidates using the inference losses of the four auxiliary tasks.In another word, we use one model as both the generator and ranker.During reranking, the corresponding question and answer of the auxiliary tasks are those generated from Task main.Specifically, we multiply the four losses together as the reranking loss, as Eq.3, where the subscript i refers to different tasks.We also design other loss aggregation methods to calculate the reranking losses, as in Appendix B.3, which shows that using ∏ or ∑ are the same in nature.
We consider the candidate with the lowest reranking loss as the one who excels in accuracy, coverage, and informativeness generally.This is inspired by the idea of evaluating generated text as text generation (Yuan et al., 2021).Through this strategy we also unify the form of training and An example of rationale sampling, in which there is a probability of kp that R n+1 is sentence t , and reranking process and manage to do them in the same model.Figure 1 shows the structure of our multitask joint training and self-reranking.

Rationale Sampling
The aforementioned methods are useful to generate one good Q-A pair.Still, how to effectively generate consecutive questions on a passage remains unsettled.By default, we select every rationale as the next sentence of previous one.However, one rationale does not necessarily correspond to only one question, because a long informative sentence may be suitable for several Q-A pairs.Hence, we propose the rationale sampling strategy, which introduces a probability that the next rationale keeps the same sentence as the current one, as Figure 2 shows.We use kp as the keeping probability.Then intuitively, we let kp be linearly related to the amount of information left in the current rationale.Traditionally, the information is hard to be calculated quantitatively.However, recall that we use the loss of Task h to measure the information of a Q-A series, so similarly, we design a inference loss to represent the rest information in current rationale.We want a higher loss to mean that less information of R n is included in the Q-A series, and more information is still left in R n .
Naturally, we first separate out the Q-A pairs on R n .Given current step n, we find n ′ , which is the most recent step where R n ′ ̸ = R n .Then, we use to represent the rest information in R n4 , which is the loss of using previous sentences and the Q-A pairs on R n to restore R n .Given our multitask framework, we use the ready-calculated losses of Task h to approximate this loss, without introducing more computation and complexity.
The approximation is a.Particularly if n is 1, a is loss h 1 .Empirically, we set the slope to be 0.2 and set a bound of 0-0.75.Finally, we get Eq.( 4), and the average kp is 0.32 in the experiments, resulting in about 1.3 questions from one sentence.
Besides, we also design other two rationale sampling strategies as in Appendix B.5, which shows that our strategy which bases on Task h to calculate information performs better than other handmade probability formulas.

Sentence-Level Beam-Search
Although rationale sampling helps catch more information and improves flexibility, it brings about more uncertainty.The mutually dependent generation may also lead to deviation (Li et al., 2021).Thus, it is crucial to guide the flow direction in every step and ensure the quality of the whole series.
Naturally, inspired by traditional beam-search (token-level), we propose the sentence-level beamsearch, as Figure 3 shows.Different from traditional beam-search, which generates a token in each search step, we generate a QA pair, and we adopt the reranking loss of each QA pair to take the place of the generation probability.Thus, in each step, we maintain several candidates with the lowest product of all previous reranking losses, which is calculated as Eq.5, where L is the final loss of our sentence-level beam-search method.
To summarize, 3.2 to 3.4 are for inference.Practically, in each generation step, we first use previous results to do rationale sampling to locate the rationale, then generate some candidates and calculate the current reranking losses, and finally we use the total losses to sentence-level beam-search and keep several Q-A flows for the next step.

Experimental Setup
We employ CoQA (Reddy et al., 2019)   In this example each step the model generates 4 question-answer candidates and the sentence-level beam size is 2.
for building Conversational Question Answering systems.The questions are conversational, and thus, every question after the first is dependent on the conversation history.The answers are freeform text with their corresponding rationales in the story.We expand the rationales to whole sentences and remove the questions with unknown answers.Finally, we get 7199 stories and each story has 15 turns of Q-A pairs on average.The training details and experiments are in Appendix A, where we also analyze the effect of joint training.
After training a model θ on CoQA, we evaluate our model by applying its question generation ability to two downstream tasks: data augmentation for QA and document-level NLI.Further, under the synthetic results on CoQA, we analyze their accuracy, coverage, and informativeness using human evaluations and a repeat-pose experiment.

Results on CoQA
First we test our strategy to augment CoQA dataset.The setting Origin means the model θ ′ is trained on the original CoQA training set, and Synth means it is trained with synthetic Q-A pairs.Inspired by Yuan et al. (2021), we additionally use the inference losses to measure the performance In Synth, we conduct single q, two step, and single m as three baseline models, where single q means we use a single Task q model to ask questions based on the origin answers, like the traditional QG methods.Two step means we first extract an answer6 , then generate a question on it using the single Task q model.Single m is a Task main model, which generates Q-A pairs.
Joint train is a multitask jointly trained model.Based on joint train model, we further add the selfreranking method, using all four auxiliary tasks.Then on this joint train + rerank model, we conduct four ablation studies of auxiliary tasks.
Under joint train + rerank model, we also introduce other two conditions, independent and relay.By default, we generate the question series in an automatic way, which means every step the previous Q-A pairs are the Q-A pairs generated in previous steps.In independent condition, we let previous Q-A pairs be empty in all steps, which means the model generates every question like the first question, but when training QA model θ ′ , we still input the previous QA pairs to align the data format with CoQA.In relay, the previous Q-A pairs of every synthetic instance are from CoQA training set, and the rationale is the ground-truth rationale sentence, which means the model inherits the Q-A flow from authentic CoQA's context.
Finally, still under joint train + rerank model, we add rationale sampling and sentence-level beam-search.Additionally, we merge the original training set with synthetic data to create the merging setting (D + D ′ ).Note that RS and SBS are not suitable for independent or relay condition.Table 3 shows the main results.Table 4 and Table 5 are the results of ablation studies and different conditions.The single q and two step model make relatively low scores when merged with original data, which means they generate relatively simple and low-quality questions.Using our one step Q-A pairs generation, in merging setting the single m model leads to higher scores even than single q, which based on origin answers.Joint train and reranking further improve the F1 qa scores by 1.32 points.From the four ablation studies in Table 4, it is not hard to see that every auxiliary task filters the results effectively, leading to 0.07 to 0.18 higher F1 qa scores.
As for our consecutive generation strategy, in Table 5, comparing the independent condition with our model, we can see that the consecutive generation largely improves the quality of questions by 2.23 F1 qa scores.Moreover, although the relay model based on the original Q-A flow truly gets better performance, when we add RS and SBS strategy to get our best model, the F1 qa score is further increased by 1.46 points, and finally it outperforms relay generation by 0.19 points.It shows that the Q-A series searched by RS and SBS are more proper even than the ground-truth flow.

Results on SQuAD and more data
To check our QG ability on out-of-domain passages, we augment SQuAD (Rajpurkar et al., 2018) dataset using our best model trained on CoQA.We select the instances without unknown answers and with a story longer than 128 words.Since the questions in SQuAD are independent but also well-organized, we manually add previous Q-A pairs to align with CoQA.
To truly reveal the ability of our model, we employ it to synthesize more questions on a large number of unlabeled passages.We randomly collect 10000 Wikipedia passages whose lengths are from 100 to 500 words.Then we use our model trained on CoQA to generate questions on them, resulting in about 0.15 million Q-A pairs, which we use to augment both CoQA and SQuAD.Table 6 shows the results.We can see that the Q-A series indeed enhances question answering.It also indicates that even if our model is trained on different dataset, its synthesized questions still help a QA model gain 0.27 more F1 qa points on SQuAD.With more Wikipedia questions, in both CoQA and SQuAD, we manage to further improve F1 qa by 0.29 and 0.23 scores.It shows that our model performs well when transferring to another dataset and can augment the QA training sets with large-scale unlabeled data.Finally we adopt large model to get 87.90 F1 qa points on CoQA.

Understand a Whole Passage (DocNLI)
To prove that our generated questions can really explore most information in an entire passage, we adopt our model for document-level NLI (Doc-NLI) task.Models are required to predict the relation (entailment or not) between a document-level premise and a hypothesis.
Traditionally, a model predicts the relation in a sequence classification way.However, given our ability to synthesize consecutive questions to understand a passage, we propose a zero-shot method to predict the relation based on question generating and answering.Since entailment requires the hypothesis to be derived from the premise, we first generate Q-A pairs given the hypothesis, and then answer these questions based on the premise.If we can get the same answers, we predict entailment.In detail, we (1) use θ to synthesize a series of Q-A pairs on the hypothesis; (2) use θ to answer Q on the premise, obtaining A ′ ; (3) check the overlap (F1 qa ) between A and A ′ .If the F1 qa exceeds a given threshold, it is entailment.
To make sure that the passages are long enough to generate a series of Q-A pairs, we select the instances whose premise and hypothesis are 200 to 1000 words from all train, dev, and test set of Doc-NLI, to be our evaluation set.It is 1677 instances in all, and we averagely generate 15 turns of Q-A each instance with rationale sampling.We use 60 points of F1 qa as the threshold of entailment.Tabel 7 shows the results.F1 nli is the harmonic mean of the precision and recall on the classification task.Impressively, using the zeroshot method, our best model surpasses the finetuned BERT model by 1.42 points of F1 nli score.Among different QG settings, although two step model gets very low losses, its F1 nli score is not very high, indicating that it generates relatively simple questions which cannot extract much information.Our one step model gets a lower F1 nli score initially but with the joint training and reranking strategy, it improves the score by 0.98 points.Moreover, we can see clearly that the RS and SBS strategies improve the result significantly by 2.10 F1 nli scores.They also manage to enlarge the discrimination between entailment and not entailment.It suggests that our consecutive generation strategy really produces question-answer pairs with most of the information in a passage, which can help understand the passage effectively.

Analyses
Accuracy and Coverage (Task a, q and r) Here we conduct two human evaluations, to prove that our strategy improves Q-A accuracy and story coverage, which are the effects of Task a, q and Task r.Since the coverage requires the model to ask for more points of a passage, we use the question-rationale consistency (accuracy of rationale) to reflect it.This is because all sentences are asked at least once, and rationale sampling further guarantees the rationales to be well-distributed, so if the rationales are all precisely questioned, the coverage should be as well satisfactory.
We randomly collect 10% stories from CoQA dev set and use different methods to generate Q-A pairs.We, the authors, then manually measure whether every question is correctly asked and answered and whether every question-answer pair is derived from its corresponding rationale.

Acc of
Ours -SBS -Rerank -Joint train Table 8: Human evaluations of accuracy of Q-A and rationale.We do not ablate RS here because it is not relevant here and will make the data unaligned.
Table 8 clearly shows that multitask joint training and reranking and sentence-level beam-search increase the accuracy of Q-A by 6.52 % and rationale by 5.39 %.Thus, we can say that our strategy, main due to Task a, q and Task r, helps generate questions more correctly and locate the rationale more precisely, leading to higher Q-A accuracy and coverage in a series of questions.

Informativeness (Task h)
To evaluate the ability to utilize information in a rationale, we present the repeat-pose experiment on CoQA.It is adapted from relay condition, and requires the model to pose another question based on the same rationale and same context as the original question.In other words, the model has to "squeeze" more information from the same rationale, so the key is whether Task h can rank the informativeness of each candidate precisely.Table 9 shows the results, which demonstrate that repeat-pose with self-reranking strategy further improves the F1 qa scores by 0.36 points, indicating that Task h indeed helps select the more informative question-answer pairs.

Conclusion
In this paper, we propose the consecutive question generation task, which synthesizes mutually connected question-answer pairs to fully explore the information in a passage.By constructing a novel multitask framework with one main task and four unified auxiliary tasks, we generate optimum Q-A series using four sub-methods, which help "generate good questions" as well as "find worthasking information".With extensive experiments, we prove that our model is able to generate highquality Q-A pairs to understand a whole passage and has the power to benefit various NLP tasks.

Limitations
In this paper, we propose a novel question generation strategy which can benefit multiple NLP scenes.For this work, we summarize two limitations as follows.First, CQG has high requirements for the training data.In this work, we adopt the CoQA corpus which is originally developed for the conversational QA task.To the best of our knowledge, CoQA is the only existing dataset which is suitable for our task.Without more datasets for evaluation, we try to improve the performance on SQuAD and DocNLI to a certain degree by generating questions zero-shot or generating questions on large-scale Wikipedia passages.In future, we hope to build a CQG specific corpus and draw more attention to this novel task.
Second, the time cost of our strategy is higher than others', because we need to train five tasks jointly and rerank on four auxiliary tasks during inference.Specifically, it is about three times more in training and four times more in inference.Detailed analysis is in Appendix B.2.In our future work, we will focus on the simplification of our strategy and the distillation of our model.Also, we will examine if a small model or a base model with fewer training data can get the same performance as other common models when using our strategy.

A Implementation and Training Details
We use PyTorch to implement our models.We acquire the pre-trained BART model7 from the Transformers library (Wolf et al., 2020).
During training, we set the batch size to 64 and learning rate to 1e-5.The maximum input length is 1024.In inference, we use beam-search with beam size 4 to generate answers for QA.Following Sultan et al. ( 2020), we combine top-k sampling(k=50) with top-p sampling(p=0.95) to generate question-answer pairs.We averagely return 4 candidates each step and set sentence-level beam size to 4, which means in our best model, every step we select 4 out of 16 candidate Q-A flows.The models we use are base size.
After training we evaluate the losses of five tasks on CoQA dev set, and the F1 qa scores using Task a. Table 10 shows the results with different training settings.We can see that joint training improves the performance on four out of five tasks, suggesting that different tasks benefit each other effectively.Prompts also enhance the Q-A ability and decrease the losses on three out of five tasks.During reranking, the scales of different losses are also not far from Table 10.

B.1 Beam-Search or Nucleus Sampling
As argued in (Sultan et al., 2020), nucleus sampling leads to higher diversity and is better than beam-search in QG.To verify that, we train two sets of models on different tasks with full strategies.We adopt beam-search with size 4 and nucleus sampling with top-k(k=50) and topp(p=0.95).Table 11 shows that nucleus sampling truly gains better results than beam-search.

B.2 Efficiency Analysis
When training the multitask model, we jointly train five tasks in one model, so the efficiency of our strategy is an inevitable topic.Here in Figure 4, we demonstrate the training curves of Task a and main using single model and multitask model.Using these three rationale sampling methods, we train three sets of models on different tasks with full strategies.The results are in Table 13.We can see that the dynamic probability is more suitable than the constant value.Also, our strategy based on auxiliary Task h performs better than that based on sentence length.Specifically, it gets 0.1 points higher on CoQA and DocNLI and gets almost the same score on SQuAD.

C Example Analysis
In this paper, we propose the consecutive question generation strategy which mainly focuses on the accuracy, coverage, and informativeness of a series of Q-A pairs generated on a whole passage.Here we further analyze the improvements of our model with a specific example.In Table C, using one passage in CoQA dev set, we present the synthesized questions produced by our model, compared with Two step, Single m, Joint train + rerank independent model, and the original data.
From the example, we can see that the original data contains 20 turns of Q-A pair, but the Q-A 17 to Q-A 20 are out of order.Our model generates 15 Q-A pairs, which is 4 turns more than other models, thanks to the rationale sampling strategy.For instance, the QA 14 and QA 15 of our model both focus on the last sentence, yet ask for two totally different information, which means our strategy really helps explore more details in a passage.
In the two step and single m model, QA 4 is inconsistent and QA 8 is grammatical erroneous, which is not accurate.
Meanwhile, since single m model does not sample rationales and asks questions sentence by sentence, QA 5 should focus on the "The girls dog • • • up ahead" rationale.However, it asks a question still based on the previous sentence, which means although we input the rationale, it gets the wrong information and asks twice about the fact that the girl was scared to go ahead, and misses a question about the dog's behavior.This is why we relate rationale accuracy to coverage and regard Task r as an important task.
In the joint train + rerank independent model, because the questions are generated independently and Task h cannot be helpful, QA 2 asks for the same answer as QA 1, which provide little information and where other models properly ask about the location they travel to.It proves that our mutually connected consecutive question generation is beneficial.
Additionally, the question series of two step and joint train + rerank independent model lack fluency.The pre-generated answers are often too long to be proper answers, and the independent Q-A pairs are too stiff and crude, mainly because of the missing of connections.Finally, among the examples we can say with confidence, our model generates the best question-answer series and explore the passage most appropriately.A 8 : Small brown bear.Q 8 : What did the bear do?A 8 : He looked up at girl and it was almost as if he was smiling at her. Q 8 : What was it?
A 8 : Almost as if he was smiling at her. (Not accurate) Q 8 : How did the bear look at the girl?
A 8 : it was almost as if he was smiling at.Q 8 : What was it?
A 8 : A bear.
Q 9 : What was it doing?A 9 : Resting in bushes.Q 9 : What did he do next?
A 9 : He rested his head on his bear paws and went back to sleep.Q 9 : What did he do next?
A 9 : Rested his head on his bear paws.Q 9 : What did the bear do then?A 9 : Rested his head on his bear paws and went back to sleep.Q 9 : What did it do?
A 9 : Rested in the bushes.
Q 10 : Did the bear seem upset?A 10 : No. Q 10 : What happened next?
A 10 : Girl and dog kept walking and finally made it out of woods.Q 10 : Did they make it out of the woods?
A 10 : Yes.Q 10 : Did the girl make it out of the woods?
A 10 : Yes.Q 10 : Did it notice the two?
A 10 : Not really.
Q 11 : What did he do?A 11 : Looked up at girl and it was almost as if he was smiling.Q 11 : What does the girl think about the meeting with the bear?A 11 : Girl does not know why bear was so friendly and to this day she has never told anyone.Q 11 : Did the girl know why the bear was so friendly?A 11 : No. Q 11 : What did the young girl not know?
A 11 : Why bear was so friendly.Q 11 : How did the girl and the dog feel?
Q 12 : What did he do next?A 12 : Rested his head on his bear paws and went back to sleep.Q 12 : How did the bear react?
A 12 : Not surprised.
Q 13 : Did they make it out of the woods?A 13 : Yes.Q 13 : What did he do?
A 13 : Looked at the girl.
Q 14 : Did the girl know why the bear was so friendly?A 14 : No. Q 14 : Was he mean?A 14 : He smiled.

Figure 1 :
Figure1: An overview of our dynamic multitask framework during joint training and self-reranking.One main task generates Q-A pairs and four auxiliary tasks generate other four CQG elements.In training, the five tasks are jointly trained in one model.In inference, the model uses the main task to generate candidates and then uses the auxiliary tasks to self-rerank them.We use the n th turn of a series of questions as an example and generate 4 candidates in inference.j ∈ {1, 2, 3, 4}.

Figure 3 :
Figure 3: An overview of the sentence-level beamsearch strategy.In this example each step the model generates 4 question-answer candidates and the sentence-level beam size is 2.
Augment QA Data Data augmentation is one common way to employ generated questions and verify QG models.To augment QA dataset D, we (1) use θ to synthesize Q-A pairs D ′ on the training set of D; (2) train another BART model θ ′ on D ′ or D + D ′ to answer questions 5 ; (3) test θ ′ on the dev set of D.

Q 7 :
In what?A 7 : What was in the bushes.Q 8 : What was it?

Q 15 :
Did the girl tell anyone?A 15 : No. Q 15 : Does she know why?A 15 : No. Q 16 : Who did she tell?A 16 : No one.Q 17 : Was the woods open and light?A 17 : No.
training set as our training data.CoQA is a large-scale dataset

Table 3 :
Results on CoQA dev set.In Synth, results without and with merging are separated by "/".In the middle are four ablation experiments of auxiliary tasks with Bart joint train+rerank.RS: rationale sampling.SBS: sentence-level beam-search.

Table 4 :
Results of ablation studies of four auxiliary tasks, on CoQA dev set.

Table 7 :
Results of DocNLI task.Finetune is a BERTbase model fine-tuned on about 0.8 million other Doc-NLI instances.When using our zero-shot method, QA results of entailment and not entailment are separated by "/".We use different models for QG, and the QA model is the same as our best model θ.

Table 9 :
Results of the repeat-pose experiment.Synthetic data are merged with the original training set.

Table 10 :
Inference losses and F1 qa scores on CoQA dev set using different training method.

Table 11 :
Results using beam-search or nucleus sampling.
model is not five times slower than the single model.In fact, it only takes about three times of steps in Task a and four times in Task main, for our multitask model to meet the optimum point compared with the single model.Also, the initial convergence speed in the first few steps of the single model is only about twice as fast as the joint model.Thus, in training we can say that the five tasks mutually benefit each other.In inference our multitask model takes about five times as long to generate a question.
Figure 4: The training curves of Task a and main using single model and multitask model.The optimum points are marked in the figures.Note that our batch size is 64.We can clearly see that the convergence speed 6631 of multitask

Table 12 :
Results of Joint train+rerank+RS+SBS model on augmenting CoQA dataset and DocNLI task, using different loss aggregation methods.