Improving the Robustness of QA Models to Challenge Sets with Variational Question-Answer Pair Generation

Question answering (QA) models for reading comprehension have achieved human-level accuracy on in-distribution test sets. However, they have been demonstrated to lack robustness to challenge sets, whose distribution is different from that of training sets. Existing data augmentation methods mitigate this problem by simply augmenting training sets with synthetic examples sampled from the same distribution as the challenge sets. However, these methods assume that the distribution of a challenge set is known a priori, making them less applicable to unseen challenge sets. In this study, we focus on question-answer pair generation (QAG) to mitigate this problem. While most existing QAG methods aim to improve the quality of synthetic examples, we conjecture that diversity-promoting QAG can mitigate the sparsity of training sets and lead to better robustness. We present a variational QAG model that generates multiple diverse QA pairs from a paragraph. Our experiments show that our method can improve the accuracy of 12 challenge sets, as well as the in-distribution accuracy.


Introduction
Machine reading comprehension has gained significant attention in the NLP community, whose goal is to devise systems that can answer questions about given documents (Rajpurkar et al., 2016;Trischler et al., 2017;Joshi et al., 2017). Such systems usually use neural models, which require a substantial number of question-answer (QA) pairs for training. To reduce the considerable manual cost of dataset creation, there has been a resurgence of studies on automatic QA pair generation (QAG), consisting of a pipeline of answer extraction (AE) and question generation (QG), to augment question answering (QA) datasets (Yang et al., 2017a;Du and Cardie, 2018;Subramanian et al., 2018;Alberti et al., 2019).
For the downstream QA task, most existing studies have evaluated QAG methods using a test set from the same distribution as a training set (Yang et al., 2017a;Zhang and Bansal, 2019;Liu et al., 2020). However, when a QA model is evaluated only on an in-distribution test set, it is difficult to verify that the model is not exploiting unintended biases in a dataset (Geirhos et al., 2020). Exploiting an unintended bias can degrade the robustness of a QA model, which is problematic in real-world applications. For example, recent studies have observed that a QA model does not generalize to other QA datasets (Yogatama et al., 2019;Talmor and Berant, 2019;Sen and Saffari, 2020). Other studies have found a lack of robustness to challenge sets, such as paraphrased questions (Gan and Ng, 2019), questions with low lexical overlap (Sugawara et al., 2018), and questions that include noise (Ravichander et al., 2021).
While existing studies have proposed data augmentation methods targeting a particular challenge set, they are only effective at the expense of the indistribution accuracy (Gan and Ng, 2019;Ribeiro et al., 2019;Ravichander et al., 2021). These methods assume that the target distribution is given a priori. However, identifying the type of samples that a QA model cannot handle in advance is difficult in real-world applications.
We conjecture that increasing the diversity of a training set with data augmentation, rather than augmenting QA pairs similar to the original training set, can improve the robustness of QA models. Poor diversity in QA datasets has been shown to result in the poor robustness of QA models (Lewis and Fan, 2019;Geva et al., 2019;Ko et al., 2020), supporting our hypothesis. To this end, we propose a variational QAG model (VQAG). We introduce two independent latent random variables into our model to learn the two one-to-many relationships in AE and QG by utilizing neural variational inference (Kingma and Welling, 2013). Incorporating the randomness of these two latent variables enables our model to generate diverse answers and questions separately. We also study the effect of controlling the Kullback-Leibler (KL) term in the variational lower bound for mitigating the posterior collapse issue (Bowman et al., 2016), where the model ignores latent variables and generates outputs that are almost the same. We evaluate our approach on 12 challenge sets that are unseen during training to assess the improved robustness of the QA model.
In summary, our contributions are three-fold: • We propose a variational question-answer pair generation model with explicit KL control to generate significantly diverse answers and questions.
• We construct synthetic QA datasets using our model to boost the QA performance in an in-distribution test set, achieving comparable scores with existing QAG methods.
• We discover that our method achieves meaningful improvements in unseen challenge sets, which are further boosted using a simple ensemble method.
2 Related Work 2.1 Answer Extraction AE aims to extract question-worthy phrases, which are worth being asked about, from each textual context without looking at the questions. AE has been performed mainly in two ways: rule-based and neural methods. Yang et al. (2017a) extracted candidate phrases using rule-based methods such as named entity recognition (NER). However, not all the named entities, noun phrases, verb phrases, adjectives, or clauses in the given documents are used as gold answer spans. As such, these rulebased methods are likely to extract many trivial phrases. Therefore, there have been studies on training neural models to identify question-worthy phrases. Du and Cardie (2018) framed AE as a sequence labeling task and used BiLSTM-CRF (Huang et al., 2015). Subramanian et al. (2018) treated the positions of answers as a sequence and used a pointer network (Vinyals et al., 2015). Wang et al. (2019) used a pointer network and Match-LSTM (Wang andJiang, 2016, 2017). Alberti et al. (2019) made use of pretrained BERT (Devlin et al., 2019).
However, these neural AE models are trained with maximum likelihood estimation; that is, each model is optimized to produce an answer set closest to the gold answers. In contrast, our model incorporates a latent random variable and is trained by maximizing the lower bound of the likelihood to extract diverse answers. In this study, we assume that there should be question-worthy phrases that are not used as the gold answers in a manually created dataset. We aim to extract such phrases.

Question Generation
Traditionally, QG was studied using rule-based methods (Mostow and Chen, 2009;Heilman and Smith, 2010;Lindberg et al., 2013;Labutov et al., 2015). After Du et al. (2017) proposed a neural sequence-to-sequence model (Sutskever et al., 2014) for QG, neural models that take context and answer as inputs have started to be used to improve question quality with attention (Bahdanau et al., 2014) and copying (Gulcehre et al., 2016;Gu et al., 2016) mechanisms. Most works focused on generating relevant questions from context-answer pairs (Zhou et al., 2018;Song et al., 2018;Zhao et al., 2018;Sun et al., 2018;Kim et al., 2019;Liu et al., 2019;Qiu and Xiong, 2019). These works showed the importance of answers as input features for QG. Other works studied predicting question types (Zhou et al., 2019;Kang et al., 2019), modeling a structured answer-relevant relation (Li et al., 2019), and refining generated questions (Nema et al., 2019). To further improve question quality, policy gradient techniques have been used (Yuan et al., 2017;Yang et al., 2017a;Yao et al., 2018;Kumar et al., 2019). Dong et al. (2019) used a pretrained language model. The diversity of questions has been tackled using variational attention (Bahuleyan et al., 2018), a conditional variational autoencoder (CVAE) (Yao et al., 2018), and top p nucleus sampling (Sultan et al., 2020). Our study is different from these studies wherein we study QAG by introducing variational methods into both AE and QG. Lee et al. (2020) is the closest to our study in terms of the modeling choice. While Lee et al. (2020) introduced an information-maximizing term to improve the consistency of QA pairs, our study uniquely controls the diversity by explicitly controlling KL values.
Despite the potential of data augmentation with QAG to mitigate the sparsity of QA datasets and avoid overfitting, not much is known about the robustness of QA models reinforced with QAG to more challenging test sets. We comprehensively evaluate QAG methods on challenging QA test sets, such as hard questions (Sugawara et al., 2018), implications (Ribeiro et al., 2019, and paraphrased questions (Gan and Ng, 2019).

Variational Autoencoder
The variational autoencoder (VAE) (Kingma and Welling, 2013) is a deep generative model consisting of a neural encoder (inference model) and decoder (generative model). The encoder learns to map from an observed variable to a latent random variable and the decoder works vice versa. The techniques of VAE have been widely applied to NLP tasks such as text generation (Bowman et al., 2016), machine translation (Zhang et al., 2016), and sequence labeling (Chen et al., 2018).
The CVAE is an extension of the VAE, in which the distribution of a latent variable is explicitly conditioned on certain variables and enables generation processes to be more diverse than a VAE (Li et al., 2018;Zhao et al., 2017b;Shen et al., 2017). The CVAE is trained by maximizing the variational lower bound of the log likelihood.

VQAG: Variational Question-Answer
Pair Generation Model

Problem Definition
Our problem is to generate QA pairs from textual contexts. We focus on extractive QA in which an answer is a text span in context. We use c, q, and a to represent the context, question, and answer, respectively. We assume that every QA pair is sampled independently given a context. Thus, the problem is defined as maximizing the conditional log likelihood log p(q, a|c) averaged over all samples in a dataset.

Variational Lower Bound with Explicit KL Control
Generating questions and answers from different latent spaces makes sense because multiple questions can be created from a context-answer pair and multiple answer spans can be extracted from a context. Thus, we introduce two independent latent random variables to assign the roles of diversifying AE and QG to z and y, respectively. VAEs often suffer from posterior collapse, where the model learns to ignore latent variables and generates outputs that are almost the same (Bowman et al., 2016). Many approaches have been proposed to mitigate this issue, such as weakening the generators (Bowman et al., 2016;Yang et al., 2017b;Semeniuta et al., 2017), or modifying the objective functions (Tolstikhin et al., 2018;Zhao et al., 2017a;Higgins et al., 2017).
To mitigate this problem, we use a variant of the modified β-VAE (Higgins et al., 2017) proposed by Burgess et al. (2018), which uses two hyperparameters to control the KL terms. Our modified objective function is: where D KL is the KL divergence, θ (φ) is the parameters of the generative (inference) model, and C a , C q ≥ 0. See Appendix A for the derivation of the objective. Tuning C a and C q was enough to regularize the KL terms in our case (see Appendix B). C a and C q can explicitly control the KL values because the KL terms are forced to get closer to these values during training. We mathematically show that the KL control can be interpreted to control the conditional mutual information I(z; a) and I(y; q). This is the major difference between our model and Lee et al. (2020), where I(q; a) is maximized to improve consistency of QA pairs. See Appendix C for the mathematical interpretation.

Model Architecture
An overview of VQAG is given in Figure 1. We denote c i , q i , and a i as the i-th word in context, question, and answer, respectively. See Appendix D for the details of the implementation.

Embedding and Contextual Embedding Layer
First, in the embedding layer, the i-th word, w i , of a sequence of length L is simultaneously converted into word-and character-level embedding vectors by using a CNN based on Kim (2014). Then, we concatenate the embedding vectors. After that, we pass the embedding vectors to the contextual embedding layer consisting of bidirectional LSTMs (BiLSTM). We obtain H ∈ R L×2d , which is the concatenated outputs from the LSTMs in each direction at each time step, and h ∈ R 2d , which is the concatenated last hidden state vectors in each direction. The superscripts of the outputs H and h shown in Figure 1 indicate where they come from. C, Q, and A denote the context, question, and answer, respectively. Following Zhao et al. (2017b), we hypothesized that the prior and and posterior distributions of the latent variables follow multivariate Gaussian distributions with diagonal covariance. The mean µ and log variance log σ 2 of these prior and posterior distributions of z and y are computed with linear transformation from h C , h A , and h Q . Next, latent variable z and y are obtained using the reparameterization trick (Kingma and Welling, 2013). Then, z and y are passed to the AE and QG models, respectively. z and y are sampled from the posteriors during training and the priors during testing.

Answer Extraction Model
We regard AE as two-step autoregressive decoding, i.e., p(a|c) = p(c start |c)p(c end |c start , c), that predicts the start and end positions of an answer span in this order. For AE, we modify a pointer network (Vinyals et al., 2015) to take as input the initial hidden state computed from linear transformation from z, which in the end diversifies AE by learning the mappings from z to a. We use an LSTM as a decoder and compute attention scores over H C .
Answer-aware Context Encoder To compute answer-aware context information for QG, we use another BiLSTM. We concatenate H C and one hot vectors of start and end positions of answer, which are fed to the BiLSTM. We obtain H CA ∈ R L×2d , which is the concatenated outputs from the LSTMs in each direction. H CA is used as the source for attention and copying in QG.
Question Generation Model For QG, we modify an LSTM decoder with attention and copying mechanisms to take as input the initial hidden state computed from linear transformation from y, which in the end diversifies QG. At each time step, the probability distribution of generating words from vocabulary P v (q i ) is computed using attention (Bahdanau et al., 2014), and the probability distributions of copying words (Gulcehre et al., 2016;Gu et al., 2016) from context P c (c j ) are computed using attention. In parallel, the switching probability p s is linearly estimated from the hidden state vector. Lastly, we compute the probability of q i as:

Dataset
We used SQuAD v1.1 (Rajpurkar et al., 2016), a large scale QA dataset consisting of documents collected from Wikipedia and 100k QA pairs created by crowdworkers, as a source dataset for QAG. Answers to questions in SQuAD can be extracted from textual contexts. Since the SQuAD test set has not been released, we use the split of the dataset, SQuAD-Du (Du et al., 2017), where the original training set is split into the training set (SQuAD Du train ) and the test set (SQuAD Du test ), and the original development set is used as the dev set (SQuAD Du dev ). The sizes of SQuAD Du train , SQuAD Du dev , and SQuAD Du test are 75,722, 10,570, and 11,877, respectively. See Appendix E for the training details of VQAG.

Answer Extraction
First, we conducted the AE experiment where inputs were the contexts and outputs were a set of multiple answer spans. The objective of this experiment is to measure the diversity and the extent to which our extracted answers cover the ground truths. We also study the effect of C a in Eq. 1.  Result Table 1 shows the result. While we tested various values of C a ranging from 0 to 100, we only report the selected values here for brevity. When using C a larger than 20, the scores did not get improved. Our model with C a = 5 performed the best in terms of the recall scores, while surpassing the diversity of NER. The highest Dist scores did not occur together with the highest recall scores. When C a is 0, the Dist score is fairly low. This implies the posterior collapse issue, though the precision scores are the best. We assert that low precision scores do not necessarily mean poor performance in our experiment because even the original test set does not cover all the valid answer spans.

Answer-aware Question Generation
We also conducted answer-aware QG experiments where the contexts and ground truth answer spans were the inputs to assess diversity and relevance to the gold questions.

Metrics
To evaluate the diversity of the generated questions, our models generated 50 questions from each context-answer pair. We reported the recall scores (denoted as "-R") of BLEU-1 (B1), METEOR (ME), and ROUGE-L (RL) per reference question. We do not report precision scores here because our motivation is to improve diversity. To measure diversity, we reported Dist-1 (D1), Entropy-4 (E4) (Serban et al., 2017;Zhang et al., 2018), and Self-BLEU-4 (SB4) (Zhu et al., 2018). 3 Baselines We compared our models with SemQG (Zhang and Bansal, 2019). 4 We used diverse beam search (Li et al., 2016b), sampled the top 50 questions per answer from SemQG, and used them to calculate the metrics as the baseline for a fair comparison Result The results in Table 2 show that our model can improve diversity while degrading the recall scores compared to SemQG. Using C q larger than 20 did not lead to improved diversity. More detailed analysis of C a and C q is provided in Appendix F.

Synthetic Dataset Construction
We created three synthetic QA datasets, denoted as D 5,5 , D 20,20 , and D 5,20 , using VQAG 202 beyoncé 's vocal range spans jody rosen highlights her tone and timbre as particularly distinctive , describing her voice as " one of the most compelling instruments in popular music " . while another critic says she is a " vocal acrobat , being able to sing long and complex melismas and vocal runs effortlessly , and in key .

£ ¢ ¡
her vocal abilities mean she is identified as the centerpiece of destiny 's child .

£ ¢ ¡
the daily mail calls beyoncé 's voice " versatile " , capable of exploring power ballads , soul , rock belting , operatic flourishes , and £ ¢ ¡ hip hop . jon pareles of the new york times commented that her voice is " velvety yet £ ¢ ¡ tart , with an insistent flutter and reserves of soul belting " . Q: how can one find her vocal abilities in key music ?
A: she is identified as the centerpiece of destiny 's child Q: how many octaves spans beyoncé 's vocal range ?
A: spans four Q: how many octaves 's vocal range spans the beyoncé hop vocal range ? A: four Q: who commented that her voice is tart yet tart ?
A: jon pareles Table 3: Heatmap of extracted answer spans and generated samples using our model. The darker the color is, the more often the word is extracted. The phrases surrounded by black boxes are the ground truth answers in SQuAD.
with the different configurations, (C a , C q ) = (5, 5), (20, 20), (5, 20) respectively. These configurations are chosen based on the recall-based metrics and diversity scores in the AE and QG results. VQAG generated 50 QA pairs from each paragraph in SQuAD Du train to construct each D. It is generally known that VAEs generate diverse but low-quality data unlike GANs. We used heuristics to filter out low-quality generated QA pairs, dropping questions that are longer than 20 words or shorter than 5 words and answers that are longer than 10 words, keeping questions that have at least one interrogative word, and removing n-gram repetition in questions. While some existing works used the BERT QA model or an entailment model as a data filter (Alberti et al., 2019;Zhang and Bansal, 2019;Liu et al., 2020), our heuristics are enough to obtain improvement in the downstream QA task as shown in §4.6. Some samples in our datasets are given in Table 3, showing that the diverse QA pairs are generated. See Appendix G to see how VQAG maps latent variables to QA pairs.

Human Evaluation
We assess the quality of the synthetic QA pairs by conducting human evaluation on Amazon Mechanical Turk. For human evaluation, we randomly chose 200 samples from synthetic QA pairs generated by Zhang and Bansal (2019) and our model with (C a , C q ) = (5, 5), (20, 20) from the paragraphs in SQuAD Du test . We also chose 100 samples from SQuAD Du test . In addition to the three items proposed by Liu et al. (2020), we asked annotators if an given answer is important, i.e., it is worth being asked about. We showed the workers a triple (passage, question, answer) and asked them to answer the four questions shown in Table 4. See Appendix H for the details. We report the responses obtained using the majority vote.
According to the results in Table 4, nearly 25% of our questions are not understandable or mean-  ingful, and 30% of our answers are incorrect for the generated questions. This result indicates that our synthetic datasets contain a considerable number of noisy QA pairs in these two aspects. However, 90 % of the generated questions are relevant to the passages, and 90% of the answers extracted by our models are question-worthy. As we will verify in §4.6, our noisy but diverse synthetic datasets are effective in enhancing the QA performance in the in-and out-of-distribution test sets.

Question Answering
We evaluated QAG methods on the downstream QA task. We evaluated our method on 12 challenge sets in addition to the in-distribution test set.

Baselines
We compared our method with the following baselines.
• SQuAD Du train BERT-base model trained on SQuAD Du train without data augmentation. • HarQG (Du and Cardie, 2018) uses neural AE and QG models and generates over one million QA pairs from top ranking Wikipedia articles not included in SQuAD. We used the publicly available dataset. 5 • SemQG (Zhang and Bansal, 2019) uses reinforcement learning to generate more SQuAD-like questions. We reran the trained model, and generated questions from the same context-answer pairs as HarQG.
• InfoHCVAE (Lee et al., 2020) uses a variational QAG model with an informationmaximizing term. We trained this model 6 on SQuAD Du train , and then generated 50 QA pairs from each context in SQuAD Du train for a fair comparison with VQAG.

Training Details
We trained pretrained BERT-base models (Devlin et al., 2019) on each synthetic dataset, and then fine-tuned it on SQuAD Du train . We adopted this procedure following existing data augmentation approach for QA (Dhingra et al., 2018;Zhang and Bansal, 2019). In our study, the order in which our synthetic datasets D were given to a QA model was tuned on the dev set.
We used the Hugging Face's implementation of BERT (Wolf et al., 2020). We used Adam (Kingma and Ba, 2014) with epsilon as 1e-8 for the optimizer. The batch size was 32. In both the pretraining and fine-tuning procedure, the learning rate decreased linearly from 3e-5 to zero. We conducted the training for one epoch using a synthetic dataset and two epochs using the original training set.
In addition to the performance of Single models, we reported the performance of Ensemble models, where the output probabilities of three different QA models are simply averaged. In practice, the top 20 candidate answer spans predicted by each QA model were used for the final prediction.

Challenge Sets
We assessed the robustness of the QA models to the following 12 challenge sets, as well as SQuAD Du test .
• NewsQA ( Table 5: QA performance (F1 score) on SQuAD Du test and the 12 challenge sets. The abbreviations of the challenge sets are explained in §4.6. Curly brackets denote an ensemble of different models (e.g., {+VQAG}*3 denotes the ensemble of three QA models, trained with different random seeds after data augmentation with VQAG). The best scores for each of the Single and Ensemble models are boldfaced. The degraded scores compared to the no data augmentation baseline (the 1st line) are in red. Sem: SemQG, Info: InfoHCVAE, V: VQAG. tions in questions, adversarial examples, and noise that may occur in real-world applications.

Results
The overall results are given in Table 5. First, we discovered that the QA model without data augmentation degraded the performance on the 12 challenge sets, showing a lack of the robustness to the natural and adversarial distribution shifts in contexts, questions, and answers. 9 With data augmentation using QAG, the indistribution scores were generally improved, except for HarQG. In the Single model setting on the challenge sets, SemQG achieved the best performance on Quo and Add. InfoHCVAE achieved the best performance on News, NQ, Hard, AddO, MT, ASR, and KB. VQAG achieved the best performance on Para, APara, and Imp. These results imply that different QAG methods have different benefits. In the Ensemble setting, taking the best of the three, the scores on SQuAD Du test , News, MT, and ASR were further improved with {+Sem,+Info,+V}.
We also attached scores that are obtained if challenge set is known in Table 5; that is, natural or synthetic samples from the same distributions as the challenge sets are available during training. For News, NQ, and Quo, we trained the BERT-base model on the corresponding training sets, which are annotated by humans. For paraphrased ques-9 The score on Para-85.7 F1 is degraded when compared to the score on the SQuAD dev set-87.9 F1, which is the source for creating Para. This means the lack of robustness to paraphrased questions. tions (Para, APara) and NoiseQA (MT, ASR, and KB), the scores were taken from Gan and Ng (2019) and Ravichander et al. (2021), respectively. These scores can be considered as the upper bounds. In NoiseQA, the QAG methods consistently improved the scores, even though they were not designed for the noise. This may be because the lack of quality in synthetic datasets, as shown in Table 4, unintentionally improved the robustness to the noise. However, the most significant performance gap (> 30 F1) between the upper bound and the no data augmentation baseline was observed in Quo. This result indicates that a QA model does not acquire coreference resolution from SQuAD, even though approximately 18% of SQuAD questions require coreference resolution (Sugawara et al., 2018). The QAG methods mitigated this gap to some extent, but there is a significant room for improvement.
The improvement in NQ is generally more prominent than that in News. This may be because both SQuAD and NQ contain paragraphs in Wikipedia. Utilizing unlabeled documents in domains such as news articles may improve the generalization to other domains, such as News.
In our experiment, our model and InfoHCVAE improved the scores despite generating QA pairs from only the paragraphs in SQuAD Du train , unlike SemQG and HarQG, which generated QA pairs from paragraphs out of SQuAD in Wikipedia. Using paragraphs in and out of SQuAD Du train as the source for QAG may be more effective.
In paraphrased questions (Gan and Ng, 2019), implications (Ribeiro et al., 2019), and NoiseQA  (Ravichander et al., 2021), augment questions that are similar to the corresponding challenge sets, that is, generating paraphrases, implications, and questions including the noise, successfully improved the robustness to these perturbations. While these methods slightly degraded or maintained the indistribution score, we showed that QAG methods are less likely to exhibit a trade-off between the inand out-of-distribution accuracies. Notably, VQAG did not degrade the scores on all the 12 challenge sets while improving the in-distribution score. In contrast, SemQG degraded the scores on Hard and MT, and InfoHCVAE degraded the score on Para. This property of VQAG may be because it can significantly improve the diversity by combining different configurations of the KL control. Moreover, the size of synthetic dataset created by VQAG was the smallest among the QAG methods. If the diversity is assured sufficiently, significantly increasing the quantity may not be necessary. In Add and AddO, we showed that the QAG methods consistently improved adversarial robustness, which has not been studied in the QAG literature.

Analysis
To assess the usefulness of each dataset D in VQAG, we conducted an ablation study. As shown in Table 6, each dataset D has meaningful effect on the performance. This result implies that creating more synthetic datasets using different configurations may further improve the performance.
To understand the differences in each dataset in terms of diversity, we conducted a simple analysis on the question type. As shown in Table 7, VQAG with different configurations corresponds to different distributions of question types, while more than 50% of the questions in the other datasets contain "what". Among the QAG methods, this point is unique to VQAG.

Discussion and Conclusion
We presented a variational QAG model, incorporating two independent latent random variables. We showed that an explicit KL control can enable our  model to significantly improve the diversity of QA pairs. Our synthetic datasets were shown to be noisy in terms of the grammaticality and answerability of questions, but effective in improving the QA performance in the in-distribution test set and the 12 challenge sets. While out synthetic datasets are noisy, they may unintentionally improve the robustness to the noise that can occur in real applications. However, we should pay attention to the negative effect of using our noisy dataset. For example, the lack of the answerability of our synthetic questions may lead to the poor performance in handling unanswerable questions, such as SQuAD v2.0. Moreover, QAG methods led to improvements in most of the 12 challenge sets while being agnostic to the target distributions during training. We need to pursue such a target-unaware method to improve the robustness of QA models, because it is quite difficult for developers to know the types of questions a QA model cannot handle in advance.
In summary, our experimental results showed that the diversity of QA datasets plays a nonnegligible role in improving its robustness, which can be boosted with QAG. We will consider using unlabeled documents in other domains to further improve the robustness to other domain corpora in our future study.

B Distribution Modeling Capacity
We originally developed a QA pair modeling task to evaluate and compare QA pair generation models. We compared models based on the probability they assigned to the ground truth QA pairs. We used the negative log likelihood (NLL) of QA pairs as the metric, namely, − log p(q, a|c). Since variational models can not directly compute NLL, we estimate NLL with importance sampling. We also estimate each term in decomposed NLL, i.e.,NLL a = − log p(a|c) and NLL q = − log p(q|a, c). The better a model performs in this task, the better it fits the test set. As a baseline, to assess the effect of incorporating latent random variables, we implemented a pipeline model similar to Subramanian et al. (2018) using a deterministic pointer network.
Result Table 8 shows the result of QA pair modeling. First, our models with C = 0 are superior to the pipeline model, which means that introducing latent random variables aid QA pair modeling capacity. However, the KL terms converge to zero with C = 0. When we set C > 0, KL values are greater than 0, which implies that latent variables have non-trivial information about questions and answers. Also, we observe that the target value of KL C can control the KL values, showing the potential to avoid the posterior collapse issue.  Table 8: QA pair modeling capacity measured on the test set. We used the same value C for the target values of KL C a and C q for simplicity. NLL: negative log likelihood of QA pairs. NLL a (NLL q ): NLL of answers (questions). D KLz and D KLy are Kullback-Leibler divergence in Ineq 1. NLL for our models are estimated with importance sampling using 300 samples.

C Information Theoretic Interpretation of the KL control
When training our models, we miximized the variational lower bound in Ineq. 1 is averaged over the training samples. In other words, the expectation with respect to the data distribution is maximized. In the ideal case, the approximated posterior q φ (z|a, c) is equal to the true posterior p θ (z|a, c). Then, the expectation of the KL terms with respect to the data distribution is equivalent to the conditional mutual information I(a, y|c). Mathematically, when the approximated posterior q φ is equal to the true posterior p θ , the expectation of the KL terms in Eq. 1 with respect to the data distribution is: Thus, controlling the KL terms is equivalent to control the conditional mutual information. The same is true for question q.

D Model Architechture
Prior and Posterior Distribution Following Zhao et al. (2017b), we hypothesized that the prior and posterior distributions of the latent variables follow multivariate Gaussian distributions with diagonal covariance. The distributions are described as follows: z|c ∼ N (µ prior Z , diag(σ 2 prior Z )) (4) The prior and posterior distributions of the latent variables, z and y, are computed as follows: Then, latent variable z (and y) is obtained using the reparameterization trick (Kingma and Welling, 2013): z = µ + σ , where represents the Hadamard product, and ∼ N (0, I). Then, z and y is passed to the AE and QG models, respectively.

Answer Extraction
Model We regard answer extraction as two-step sequential decoding, i.e., p(a|c) = p(c end |c start , c)p(c start |c), which predicts the start and end positions of an answer span in this order. For AE, we modify a pointer network (Vinyals et al., 2015) to take into account the initial hidden state h AE 0 = W 1 z + b 1 , which in the end diversify AE by learning the mappings from z to a. The decoding process is as follows: p(c t i |c t i−1 , c) = softmax(u i ) where 1 ≤ i ≤ 2, 1 ≤ j ≤ L C , h AE i is the hidden state vector of the LSTM, h IN i is the i-th input, t i denotes the start (i=1) or end (i=2) positions in c, and v, W n and b n are learnable parameters. We learn the embedding of the special token "⇒" as the initial input h IN 1 . When we used the embedding vector e t i as h IN i+1 , instead of H C t i , following Subramanian et al. (2018), we observed that the extracted spans tended to be long and unreasonable. We assume that this is because the decoder cannot get the positional information from the input in each step.
Question Generation Model For QG, we modify an LSTM decoder with attention and copying mechanisms to take the initial hidden state h QG 0 = W 4 y + b 3 as input to diversify QG. In detail, at each time step, the probability distribution of generating words from vocabulary using attention (Bahdanau et al., 2014) is computed as: Accordingly, the probability of outputting q i is: ) p(q i |q 1:i−1 , a, c) (25) = p g P vocab (q i ) + (1 − p g ) j:c j =q i a copy ij (26) where σ is the sigmoid function.

E Training Details
We use pretrained GloVe (Pennington et al., 2014) vectors with 300 dimensions and freeze them during training. The pretrained word embeddings were shared by the input layer of the context encoder, the input and output layers of the question decoder. The vocabulary has most frequent 45k words in our training set. The dimension of character-level embedding vectors is 32. The number of windows is 100. The dimension of hidden vectors is 300. The dimension of latent variables is 200. Any LSTMs used in this paper has one layer. We used Adam (Kingma and Ba, 2014) for optimization with initial learning rate 0.001. All the parameters were initialized with Xavier Initialization (Glorot and Bengio, 2010). Models were trained for 16 epochs with a batch size of 32. We used a dropout (Srivastava et al., 2014) rate of 0.2 for all the LSTM layers and attention modules.

F Answer Extraction and Question Generation
Tables 9 and 10 show the detailed results of AE and QG. Various values of C a and C q are explored.  G Latent Interpolation Table 11 shows the latent interpolation between two ground-truth QA pairs using VQAG with (C a , C q ) = (5, 20). This result shows that z controls answer and y controls question.