Asking Questions Like Educational Experts: Automatically Generating Question-Answer Pairs on Real-World Examination Data

Generating high quality question-answer pairs is a hard but meaningful task. Although previous works have achieved great results on answer-aware question generation, it is difficult to apply them into practical application in the education field. This paper for the first time addresses the question-answer pair generation task on the real-world examination data, and proposes a new unified framework on RACE. To capture the important information of the input passage we first automatically generate (rather than extracting) keyphrases, thus this task is reduced to keyphrase-question-answer triplet joint generation. Accordingly, we propose a multi-agent communication model to generate and optimize the question and keyphrases iteratively, and then apply the generated question and keyphrases to guide the generation of answers. To establish a solid benchmark, we build our model on the strong generative pre-training model. Experimental results show that our model makes great breakthroughs in the question-answer pair generation task. Moreover, we make a comprehensive analysis on our model, suggesting new directions for this challenging task.


Introduction
Question-answer pair generation (QAG) is to do question generation (QG) and answer generation (AG) simultaneously only with a given passage. The generated question-answer (Q-A) pairs can be effectively applied in numbers of tasks such as knowledge management (Wagner and Bolloju, 2005), FAQ document generation (Krishna and Iyyer, 2019), and data enhancement for reading comprehension tasks (Tang et al., 2017;Liu et al., 2020a,b). Particularly, high-quality Q-A pairs can facilitate the instructing process and benefit on creating educational materials for reading practice and assessment (Heilman, 2011;Jia et al., 2020). * Corresponding author Recently, much work has devoted to QG, while the QAG task is less addressed. The existing approaches on this task can be roughly grouped into two categories: the joint learning method to treat QG and answer extraction as dual tasks (Collobert et al., 2011;Firat et al., 2016); the pipeline strategy that considers answer extraction and QG as two sequential processes (Cui et al., 2021).
Although some progress has been made, the QAG methods still face several challenges. Most of the existing techniques are trained and tested on the Web-extracted corpora like SQuAD (Rajpurkar et al., 2016), MARCO (Nguyen et al., 2016) and NewsQA . Considering the biased and unnatural language sources of datasets, employing these techniques in educational field is difficult. Moreover, most of the previous works regard the answer text as a continuous span in the passage and directly obtain the answer through an extractive method, which may not meet the demands of real world data.
To alleviate the above limitations, we propose to perform QAG on RACE (Lai et al., 2017) . RACE is a reading comprehension corpus collected from the English exams of middle and high schools in China. Compared to Web-extracted corpora, there are two notable characteristics of RACE: realworld data distribution and generative answers, which raise new challenges for the QAG task. First, the examination-type documents have more abundant information and more diverse expressions, putting forward higher requirements to the generation model. Second, the model is required to be able to summarize the whole document rather than extracting a continuous span from the input.
In this paper, we propose a new architecture to deal with this real-world QAG task, as illustrated in Figure 1. It consists of three parts: rough keyphrase generation agent, question-keyphrase iterative generation module and answer generation agent. We first generate keyphrases based on the given document, and then optimize the generated question and keyphrases with an iterative generation module. Further, with the generated keyphrases and questions as guidance, the corresponding answer is generated. To cope with the complex expressions of examination texts, we base our model on the generative pre-training model ProphetNet (Qi et al., 2020). We conduct experiments on RACE and achieve satisfactory improvement compared to the baseline models.
Our contributions are summarized as follows: 1) We are the first to perform QAG task on RACE. The proposed methods can be easily applied to real-world examination to produce reading comprehension data.
2) We propose a new architecture to do questionanswer pair joint generation, which obtains obvious performance gain over the baseline models.
3) We conduct a comprehensive analysis on the new task and our model, establishing a solid benchmark for future researches.

Data Analysis
In this section, we present a deep dive into SQuAD and RACE to explore the challenging of QAG on real-world examination data.

The Questions Are Difficult
To get a basic sense of the question type in RACE and SQuAD, we count the proportion of the leading unigrams and bigrams that start a question for both datasets, and report the results in Figure 2.
Through the statistics we can reasonably conclude that questions in RACE are much more difficult than SQuAD, for 'what' questions (mostly detail-based) play a major role in SQuAD while RACE are more concerned with 'why' questions (mostly inference-based). To answer or generate inference-based questions will be more challenging than detail-based questions, since readers need to do an integration of information and conduct knowledge reasoning.

The Answers Are Generated
Also, we investigate the n-gram matching rate between an answer and its corresponding passage to measure the AG difficulty for both datasets. On SQuAD, the answers are exacted sub-spans in the passage, so the n-gram matching ratio is fixed to 100%. However on RACE, only 68.8% unigrams in the answer are also in the passage, and the matching ratio of bigram and trigram spans is even much lower, with 28.9% and 14.4% respectively. It indicates that the conventional extracting strategy of keyphrase is not appropriate for QAG task on realworld examination texts.

Model Overview
In this paper, we propose a new framework for Q-A joint generation based on a generative pretraining model, and the detailed model structure is illustrated in Figure 3. The whole generation process can be split into three components: Step 1. Rough keyphrase generation: generate rough keyphrases from the document, which are fed to the question generation process; Step 2. Iterative question and keyphrase generation: optimize question and keyphrase iteratively with the initial input keyphrase from Step 1.
Step 3. Answer generation: generate answers with the output questions and keyphrases.
To clearly describe our model, we use p to denote the input passage, q i and k i refer to the generated question and keyphrase in the i-th iteration. Specially, k 1 denotes the rough keyphrase from Step 1. Let a refer to the generated answer, and m refer to the iterative training epochs, we can give a brief definition for q, k, a as: Throughout the entire training process, the objective is to minimize the negative log likelihood of the target sequence: Our models are based on the generative pretraining model ProphetNet (Qi et al., 2020). Drawing lessons from XLNet , ProphetNet proposes a n-stream self-attention method with a n-gram prediction pre-training task, which can be described as: where H (k) denotes the main stream selfattention, which is the same as transformer's selfattention, g (k+1) t and s (k+1) t denote hidden state of the 1-st and 2-nd predicting stream at time step t from the k +1-th layer, and are used to predict y +1 and y + 2 respectively. ⊕ denotes concatenation operation. ProphetNet achieves great progress in generative tasks and obtains the best result on QG task (Liu et al., 2021).
It is worth emphasizing that our training framework has nothing to do with the choice of the underlying model, so we can choose either normal Seq2Seq models like LSTM (Hochreiter and Schmidhuber, 1997) and Transformer (Vaswani et al., 2017) or pre-training models like BART (Lewis et al., 2020) to replace ProphetNet.

Two-stage Fine-tuning of Keyphrase Generation
In this paper, we aim to generate multiple Q-A pairs with a given document as the only input. Accordingly, the vital first step is to obtain questionworthy keyphrases which provide the overall im-portant information of the long document. Actually, keyphrase is an approximation to the ground-truth answer and an extraction model is often utilized in previous works. However, considering the different answer characteristics we discussed in Section 2.2, extractive methods may not work well on RACE. Therefore, as an alternative, we construct a ProphetNet-based keyphrase generation model to capture the key information among the document. There are two reasons why we choose Prophet-Net for keyphrase generation. First, ProphetNet is effective enough since it is proved to be perfectly competent on several automatic generative works. More importantly, ProphetNet is employed in all the three stages of our unified model consistently, which ensures the generality and simplicity of our framework.
To further improve the quality of the generated keyphrases, we adopt a two-stage fine-tuning strategy. First, we use SQuAD as data augmentation for the first-stage training. The keyphrase generation model takes the passage as input and concatenates all the reference answers corresponding to the passage with a spacial separator as the training target. Then, we fine-tune the model meticulously on RACE dataset. Due to the characteristics of RACE, we remove stop words from the reference answers to form several separate key answer phrases, which serve as the training target in second stage training. We represent the generated result as k 1 , which is a string consisted of multiple keyphrases during inference.

Question-Keyphrase Iterative Generation
We propose a multi-task framework for iterative generation between question and keyphrase. Question generation is first launched taking the generated k 1 as assistance, where k 1 will be split and separately fed into the question generation model. Then the generated question is fed back to the keyphrase generation agent for optimization.

Question Generation Agent
At each step, k i will be transmitted from the keyphrase generation agent to question generation process to assist the generation of q i . Briefly, we concatenate p and k i directly with a separator [CLS] as the input of the encoder: where f is the embedding layer, v i q is the output hidden vector of the QG agent's encoder in the i-th iteration, M i q is the QG agent in i-th iteration. On the decoder side, the agent puts h i q , the last layer's hidden state of decoder, into the linear output layer to calculate the word probability distribution with a softmax function: where W is the set of learnable parameters.

Keyphrase Generation Agent
Both two generation agents in the multi-task framework has similar structure except the input layer. For keyphrase generation agent, p is applied individually to the embedding layer for the embedding matrix e k . Then e k will be concatenated with h i−1 q to compose the input of the agent's encoder: where f is the embedding layer and h i−1 q is the final hidden state of the i − 1-th QG agent, M i k refers to the keyphrase generation agent in i-th iteration.

Answer Generation
After the iterative training for m epochs, we generate the final answer with the assistance of the optimized question q m and keyphrase k m . We connect k m , q m and p by a separator [CLS] and input it into the ProphetNet: where M a refers to the Q-K guided answer generation model.

Experiment Setting
Our model adopts the transformer-based pretraining model ProphetNet which contains a 12layer transformer encoder and a 12-layer n-stream self-attention decoder. All of our agents utilize the built-in vocabulary and the tokenization method of BERT (Devlin et al., 2019). The dimension of the embedding vector is set to 300. The embedding/hidden size is 1024 and the feed-forward filter size is 4096. We use Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1×10 −5 and the batch size is set as 10 through the entire training  procedure. We train our model on 2 RTX 2080Ti GPUs for about three days.
In the two-stage fine-tuning for keyphrase generation, we set the training epochs as 15 and 10 on SQuAD and RACE respectively. In the later iterative training, we set the training epochs as 15 for QG and 10 for keyphrase generation.
We carry out the training and inference on EQG-RACE dataset 1 proposed by Jia et al. (2020). The passage numbers of training set, validation set and test set are respectively 11457, 642, 609.
We choose BLEU-4, ROUGE and METEOR to evaluate our model's performance.

Comparing Models
To the best of our knowledge, our work is the first to perform QAG on RACE. For reference, we list some results of answer-aware models that are quoted from Jia et al. (2020).
Seq2Seq ( ProphetNet keyphrase guided: A ProphetNet model to generate question and answer independently, with the guidance of the generated rough keyphrases.
ProphetNet with golden phrases: A Prophet-Net model to generate questions with the guidance of the golden answer phrases, which are constructed by removing the stop words from the ground-truth answers, as discussed in Section 3.2.
ProphetNet answer-aware: A ProphetNet model to generate questions with the guidance of the ground-truth answers, which can be regarded as the upper bound for QG.

Main Results
The experiment results are shown in Table 1. For answer-aware QG, the RNN Seq2Seq model just gets a 4.75 BLEU-4, and the Transformer's performance is also not satisfactory.
It is exciting to see that our model gets a close performance with the previous state-of-art answer-guided model AGGCN-QG, achieving a 11.55 BLEU-4 and 16.13 METEOR. The answeragnostic ProphetNet yields a 7.20 BLEU-4 on the QG task and 3.78 on the AG task, demonstrating that even the strong pre-training model can not perform well on this challenging QAG task. Our unified model improves 4.35 points for QG and 3.09 points for AG over the basic ProphetNet model.
When the iteration epoch m=2, we get the best results, but there is no obvious improvement on the results if we continue to increase the number of m. Specially, our question-keyphrase iterative agent brings an obvious performance gain on AG.
When we feed the right answer into ProphetNet (ProphetNet answer-aware), we get a quite high performance with a 20.53 BLEU-4, which indicates that our simple method by concatenating the passage p and answer into the input of ProphetNet is effective. When we replace the answer with separate phrases, the performance of QG slightly drops. We will discuss more on this point in Section 5.3.

Keyphrase Generation: Mixed Data
Augmentation or Two-stage Fine-tuning As discussed in Section 3.2, the SQuAD dataset is applied to enhance our training data for rough keyphrase generation. There exist different strategies to exploit the two datasets of SQuAD and RACE. RACE only: Only apply RACE data to fine-tune the pre-training model.
SQuAD only: Only apply SQuAD data to finetune the pre-training model.
Mixed data augmentation: Merge the data from SQuAD and RACE together and fine-tune the pre-training model on the mixed collection.
Two-stage fine-tuning: Launch a two-stage fine-tuning, first on the larger SQuAD and then on the smaller RACE.  We report the experimental results in Table 2. Just applying RACE data for fine-tuning individually even leads to the reduction of the result score, which may be caused by two reasons. First, the data size of RACE is small. Second, adopting the discontinuous answer phrases as the training target may lead to the loss of semantic information, and this is why we introduce SQuAD to enhance our training data. The keyphrases generated by twostage fine-tuning bring better results than one-stage mixed data. Given that the answers of SQuAD are part of the original passage, completely training the keyphrase generation with SQuAD may lead to the degeneration of our model into an extraction model. In contrast, the two-stage fine-tuning can take great advantage of the large scale data of SQuAD as well as avoiding the degeneration mistake.

Multi-task Learning: Shared Encoder or Iterative Training
We conduct experiments on different multi-task learning methods for Q-A pair joint learning. Shared-encoder architecture: As illustrated in Figure 4(a), encode passage information with a shared-encoder and generate question and answer with two decoders respectively.
Question-answer iterative generation: As illustrated in Figure 4(b), capture keyphrases k with a generation agent and iteratively generate question and answer with the input passage p and k.

Question Answer
ProphetNet base 7.20 3.78 shared-encoder 3.06 1.29 q-a iterative generation 10.39 5.78 q-k iterative generation 11.55 6.87 Question-keyphrase iterative generation: As illustrated in Figure 1, the three stage generation process we adopt in our final model.
We report the BLEU-4 score of the generated Q-A pairs in Table 3. The shared-encoder model just obtains a 3.06 BLEU-4 on question and 1.29 on answer. It demonstrates that the shared-encoder model can not generate desirable results in our task, and this may be related to the structure of ProphetNet. Iterative training yields obvious performance gain over both the ProphetNet base and shared-encoder method. Specially, the Q-K iterative method outperforms the Q-A based one in both QG and AG tasks, because the keyphrases generated in the first stage are relatively rough and should be further optimized.

Key Content: Key Sentences or Key Phrases
To capture the important content that is worthy of being questioned and answered is vital to our task. Aiming at Q-A generation, we can use keyphrases or key sentences to represent the important content of a passage. Key sentences benefit from the complete syntactic and semantic information while keyphrases are more flexible and they will not bring useless information to disturb the generation process. We conduct the following experiments to investigate this issue. Keyphrase: Use the rough keyphrase generation agent (discussed in Section 3.2) to obtain the keyphrases from the passage.
Most similar sentence: Select the key sentence that has the highest matching rate with the keyphrases generated from the rough keyphrase generation agent.
Summarized key sentence: Apply the pretrained model for text summarization, BertSum (Liu, 2019), to extract key sentences from the passage.
The results of these methods can be found in Table 4. Overall, the keyphrase based model achieves a better result than ones based on key sentences. In more detail, key sentences extracted by Bert-Sum bring a worse performance, which implies there exactly lies a gap between the existing text summarization and keyphrase extraction tasks orientated towards Q-A generation. On the other side, a slight decline arises on all three metrics when we replace the keyphrases with the most similar sentence, which is probably because some information distortion occurs after the replacement operation.  Question/Answer Relevancy: which measures whether the generated question/answer is semantic relevant to the input passage; Answerability: which indicates whether the generated question can be answered by the generated answer.
The evaluation results are shown in Table 5. The Spearman correlation coefficients between annotators are high. Both models achieve nearly full marks on fluency and relevancy due to the powerful performance of pre-training model. Especially, our unified model obtains an obvious improvement on answerability, which demonstrates the effectiveness of our joint learning method.

Case Study
Further, we make a detailed analysis on the above 100 samples. According to the quality and relevancy with the reference QA, we categorize the generated QA pair into four levels: level 1: is of high-quality and similar with references;   level 2: is of high-quality while has different focus with the reference; level 3: has grammatical and syntactic errors; level 4: has mismatch error between the generated question and answer.
We count the proportions of these four levels and display some corresponding case examples in Table 6.
We find that 81% of our results are of highquality, but most of them (65%) have different questioned focus with the reference, like Case 2. Among them, just 4% results have grammatical and syntactic errors. However, our model suffers from the problem of information mismatch when encountering the co-occurrence of complex details in a short piece of text, which causes about 15% Q-A mismatch error in the generated results. As shown in case 3, according to the passage, the students should make "15 frames" for "1-second film" and "900 frames" for "1-minute film", while our model confuses the correspondence. Given the document and answer, Answer-aware Question Generation (AQG) focuses on generating grammatical and answer-related questions, which can help to construct Q-A pairs automatically. Most AQG methods perform sequence-to-sequence generation with neural network models. Song et al. (2018) propose a LSTM-based model with different matching strategies to capture answer-related information among the document. Yao et al. (2018) present a Seq2Seq model deployed in GAN with a discriminator, which encourages model to generate more readable and diverse questions with the certain types. Sun et al. (2018) and Chen et al. (2019) incorporate lexical features such as POS, NER, and answer position into the encoder for better representing the document. Jia et al. (2020) notice the disadvantages when applying Web-extracted data to real-world question generation task and construct a feature-enhanced model on RACE, while it regards answers as available and generates questions with the assistance of answer information.

Question-Answer Pair Generation
Existing QAG methods can be grouped into two categories. First, the joint-learning method conducts question generation and answer extraction simultaneously. Sachan and Xing (2018) propose a self-training method for jointly learning to ask questions as well as answer questions. Tang et al. (2017) regard QA and QG as dual tasks and explicitly leverage their probabilistic correlation to guide the training process of both QA and QG.  design a generative machine that encodes the document and generates a question (answer) given an answer (question). Second, the pipeline strategy sequentially generates (or extracts) answers and questions with given documents. Du and Cardie (2018) first identify the question-worthy answer spans from input passage with an extraction model and then generate the answer-aware question. Golub et al. (2017) propose a two-stage SynNet with an answer tagging model and a question synthesis model for questionanswer pair generation. Willis et al. (2019) explore two different approaches based on classifier and generative language model respectively to extract high-quality answer phrases from the given passage.

Conclusion
In this paper, we address the QAG task and propose a unified framework trained on the educational dataset RACE. We adopt a three-stage generation method with a rough keyphrase generation model, an iterative message-passing module and a question-keyphrase guided answer generation model. Our model achieves close performance as the state-of-the-art answer-aware generation model on QG task, and obtains a great improvement on the answerability of the generated pairs compared to the basic pre-training model. There is significant potential for further improvement in our proposed QAG task, to help people produce reading comprehension data in real-world applications.