Back-Training excels Self-Training at Unsupervised Domain Adaptation of Question Generation and Passage Retrieval

In this work, we introduce back-training, an alternative to self-training for unsupervised domain adaptation (UDA). While self-training generates synthetic training data where natural inputs are aligned with noisy outputs, back-training results in natural outputs aligned with noisy inputs. This significantly reduces the gap between target domain and synthetic data distribution, and reduces model overfitting to source domain. We run UDA experiments on question generation and passage retrieval from the Natural Questions domain to machine learning and biomedical domains. We find that back-training vastly outperforms self-training by a mean improvement of 7.8 BLEU-4 points on generation, and 17.6% top-20 retrieval accuracy across both domains. We further propose consistency filters to remove low-quality synthetic data before training. We also release a new domain-adaptation dataset - MLQuestions containing 35K unaligned questions, 50K unaligned passages, and 3K aligned question-passage pairs.


Introduction
In domains such as education and medicine, collecting labeled data for tasks like question answering and generation requires domain experts, thereby making it expensive to build supervised models. Transfer learning can circumvent this limitation by exploiting models trained on other domains where labeled data is readily available (Bengio, 2012;Ruder et al., 2019). However, using these pre-trained models directly without adapting to the target domain often leads to poor generalization due to distributional shift (Zhao et al., 2019). To address this issue, these models are further trained on cheap synthetically generated labeled data by exploiting unlabeled data from target domain (Ramponi and Plank, 2020). One such popular data augmentation method for unsupervised domain adaptation (UDA) is self-training (Yarowsky, 1995).

Synthetic Training Data Algorithm
Input Output Question Generation (QG) Self-Training p u ∼ P T (p)q ∼ P S (q|p u ) Back-Trainingp ∼ P S (p|q u ) q u ∼ P T (q)

Passage Retrieval (IR)
Self-Training q u ∼ P T (q)p ∼ P S (p|q u ) Back-Trainingq ∼ P S (q|p u ) p u ∼ P T (p) Table 1: Self-Training and Back-Training for unsupervised domain adaptation of question generation and passage retrieval. In self-training, inputs are sampled from the target domain data distribution P T and their corresponding outputs are generated using a supervised model P S trained on the source domain. In backtraining, the inverse happens: outputs are sampled from P T and their corresponding inputs are generated using P S . Notation: q and p denote questions and passages respectively, . u denotes samples from the target domain and. denotes the samples generated by a supervised model trained on the source domain.
In self-training, given a pre-trained model that can perform the task of interest in a source domain and unlabeled data from the target domain, the pretrained model is used to predict noisy labels for the target domain data. The pre-trained model is then fine-tuned on synthetic data to adapt to the new domain. To improve the quality of the synthetic data, it is also common to filter out low-confidence model predictions (Zhu, 2005).
A model fine-tuned on its own confidence predictions might suffer from confirmation bias which leads to overfitting (Yu et al., 2020). This means that the distributional gap between the target domain's true output distribution and the learned output distribution could grow wider as training proceeds. In this paper, we propose a new training protocol called back-training which closes this gap (the name is inspired from back-translation for ma-chine translation). While self-training generates synthetic data where noisy outputs are aligned with quality inputs, back-training generates quality outputs aligned with noisy inputs. The model finetuned to predict real target domain outputs from noisy inputs reduces overfitting to the source domain (Vincent et al., 2008), and matches the target domain distribution more closely.
We focus on unsupervised domain adaptation (UDA) of Question Generation (QG) and Passage Retrieval (IR) from generic domains such as Wikipedia to target domains. Our target domain of interest is machine learning, as it is a rapidly evolving area of research. QG and IR tasks could empower student learning on MOOCs (Heilman and Smith, 2010). For example, from a passage about linear and logistic regression, an education bot could generate questions such as what is the difference between linear and logistic regression? to teach a student about these concepts. Moreover, IR models could help students find relevant passages for a given question (Fernández-Luna et al., 2009). In this domain, unsupervised data such as text passages and questions are easy to obtain separately rather than aligned to each other.
We also perform our main domain adaptation experiments on biomedical domain using PubMedQA dataset (Jin et al., 2019) to further strengthen our hypothesis. Table 1 demonstrates the differences between self-training and back-training for QG and IR. Consider the QG task: for self-training, we first train a supervised model P S (q|p) on the source domain that can generate a question q given a passage p. We use this model to generate a questionq for an unsupervised passages p u sampled from the target domain distribution P T (p). Note thatq is generated conditioned on the target domain passage using P S (q|p u ). We use the pairs (p u ,q) as the synthetic training data to adapt P S (q|p) to the target domain. In back-training, we assume access to unsupervised questions and passages from the target domain. We first train an IR model P S (p|q) on the source domain, then sample a question q u from the target domain distribution P T (q). We condition the retriever on this question i.e., P S (p|q u ), and retrieve a passagep from the target domain and treat it as a noisy alignment. We use the pairs (p, q u ) as the synthetic training data to adapt P S (q|p). Table 1 also describes the details of domain adaptation for the passage retriever.
Our contributions and findings are as follows: 1) We show that QG and IR models trained on NaturalQuestions (Kwiatkowski et al., 2019) generalize poorly to target domains, with at least 17% mean performance decline on both QG and IR tasks. 2) Although self-training improves the domain performance marginally, our back-training method outperforms self-training by a mean improvement of 7.8 BLEU-4 points on generation, and 17.6% top-20 retrieval accuracy across both target domains. 3) We further propose consistency filters to remove low-quality synthetic data before training. 4) We release MLQuestions: a domain adaptation dataset for the machine learning domain containing 35K unaligned questions, 50K unaligned passages, and 3K aligned question-passage pairs.

Background
In this section, we describe the source and target domain datasets, models for question generation and passage retrieval, and the evaluation metrics.

Source Domain: NaturalQuestions
We use the NaturalQuestions dataset (Kwiatkowski et al., 2019) as our source domain. NaturalQuestions is an open-domain question answering dataset containing questions from Google search engine queries paired with answers from Wikipedia. We use the long form of the answer which corresponds to passages (paragraphs) of Wikipedia articles. It is the largest dataset available for open-domain QA, comprising of 300K training examples, each example comprising of a question paired with a Wikipedia passage. We label 200 random questions of NaturalQuestions and annotate them into 5 different classes based on the nature of the question as per Nielsen et al. (2008). Table 2 shows these classes and their distribution. As seen, 86% of them are descriptive questions starting with what, who, when and where. Refer to Appendix A.2 for details on dataset pre-processing and Appendix A.4 for detailed taxonomy description.

Target Domain I: Machine Learning
Our first target domain of interest is machine learning. There is no large supervised QA dataset for this domain, and it is expensive to create one since it requires domain experts. However, it is relatively cheap to collect a large number of ML articles and questions. We collect ML concepts and passages  from the Wikipedia machine learning page 1 and recursively traverse its subcategories. We end up with 1.7K concepts such as Autoencoder, word2vec etc. and 50K passages related to these concepts. For question mining, we piggy-back on Google Suggest's People also ask feature to collect 104K questions by using above machine learning concept terms as seed queries combined with question terms such as what, why and how. However, many questions could belong to generic domain due to ambiguous terms such as eager learning. We employ three domain experts to annotate 1000 questions to classify if a question is in-domain or out-of-domain. Using this data, we train a classifier (Liu et al., 2019) to filter questions that have in-domain probability less than 0.8. This resulted in 46K in-domain questions, and has 92% accuracy upon analysing 100 questions. Of these, we use 35K questions as unsupervised data. See appendix A.3 for classifier training details and performance validation.
The rest of the 11K questions are used to create supervised data for model evaluation. We use the Google search engine to find answer passages to these questions, resulting around 11K passages. Among these, we select 3K question and passage pairs as the evaluation set for QG (50% validation and 50% test). For IR, we use the full 11K passages as candidate passages for the 3K questions. We call our dataset MLQuestions. Table 2 compares MLQuestions with Natu-ralQuestions. We note that MLQuestions has higher diversity of question classes than NaturalQuestions, making the transfer setting challenging.

Target Domain II: Biomedical Science
Our second domain of interest is biomedicine for which we use PubMedQA (Jin et al., 2019) dataset.
Questions are extracted from PubMed abstract titles ending with question mark, and passages are the conclusive part of the abstract. As unsupervised data, we utilize PQA-U(nlabeled) subset containing 61.2K unaligned questions and passages. For supervised data, we use PQA-L(abeled) subset of 1K question-passage pairs manually curated by domain experts. We use the same dev-test split of 50-50% as (Jin et al., 2019) as the evaluation set for QG. For IR, in order to have the same number of candidate passages as MLQuestions, we combine randomly sampled 10K passages from PQA-U with 1K PQA-L passages to get 11K passages as candidate passages for 1K questions.

Question Generation Model
We use BART (Lewis et al., 2020) to train a supervised QG model on NaturalQuestions. BART is a Transformer encoder-decoder model pretrained to reconstruct original text inputs from noisy text inputs. Essentially for QG, BART is further trained to learn a conditional language model P S (q|p) that generates a question q given a passage p from the source domain. For experimental details, see A.1.

Passage Retrieval Model
We use the pretrained Dense Passage Retriever (DPR; Karpukhin et al. 2020) on NaturalQuestions. DPR encodes a question q and passage p separately using a BERT bi-encoder and is trained to maximize the dot product (similarity) between the encodings E P (p) and E Q (q), while minimizing similarity with other closely related but negative passages. Essentially, DPR is a conditional classifier P S (p|q) that retrieves a relevant passage p given a question q from the source domain. For model training details, see A.1.
(2020) by measuring the fraction of cases where the correct passage lies in the top k retrieved passages. We consider 11K passages in all datasets for retrieval during test time.

Transfer from Source to Target Domain without Adaptation
We investigate how well models trained on Natu-ralQuestions transfer directly to our target domains without any domain adaptation. For comparison, we also present the results on NaturalQuestions.
To be fair, we sample equal number of samples from the development set of NaturalQuestions as in the test set of MLQuestions and PubMedQA for QG and IR tasks. Figure 1 shows the results. We observe high performance drops across all generation metrics (14-20%) from NaturalQuestions (IID data) to MLQuestions and PubMedQA (OOD Data). Human evaluation on QG (see Table 7) also reveals that the generated questions are either generic, or fail to understand domain-specific terminology. OOD performance in the IR task is even worse (25-40% drop), revealing a huge distribution shift between the source and target domain.

Unsupervised Domain Adaptation
In this section, we describe self-training and backtraining methods to generate synthetic training data for unsupervised domain adaptation (UDA). We also introduce consistency filters to further improve the quality of the synthetic data.

Problem Setup
The source domain consists of labeled data containing questions paired with passages . Note that P U and Q U are not necessarily aligned with each other. Given this setup, our goal is to learn QG and IR models with parameters θ ≡ {θ G , θ R } that can achieve high generation and retrieval performance on target domain T . Table 3 describes the notations used across the paper.

Self-Training for UDA
Self-training (Yarowsky, 1995) involves training a model on its own predictions. We present the proposed self-training for UDA in Algorithm 1. First the baseline models θ G and θ R are trained on the source passage-question corpus D S . Then, at each iteration, the above models generate pseudolabeled data from unlabeled passages P U for question generation and questions Q U for passage retrieval. For QG, θ G generates a questionq for each p u ∈ P U and adds (p u ,q) to synthetic data S G . For IR, θ R retrieves a passagep from P U for each q u ∈ Q U and adds (q u ,p) to S R . The models θ G and θ R are fine-tuned on S G and S R respectively. The process is repeated for a desired number of iterations, which we refer to as iterative refinement. Note that in self-training, inputs are sampled from target domain and the outputs are predicted (noisy).

Back-Training for UDA
The main idea of back-training is to work backwards: start with true output samples from the Algorithm 1 Vanilla Self-Training Back-Training for unsupervised domain adaptation. Vanilla algorithms can be improved further using consistency filters for qu ∈ QU do 5:p ← Retrieve p from PU closest to qu using θR 6: add (p, qu) to SR SG 7: end for 8: for pu ∈ PU do 9:q ← Generate q from pu using θG 10: add (pu,q) to SG SR 11: end for 12: θG ← Finetune on SG, θR ← Finetune on SR 13: until dev performance decreases target domain, and predict corresponding inputs which aligns the most with the output. While selftraining assumes inputs are sampled from the target domain distribution, back-training assumes outputs are sampled from the target domain distribution. When two tasks are of dual nature (i.e., input of one task becomes the output of another task), backtraining can be used to generate synthetic training data of one task using the other, but on a condition that outputs can be sampled from the target domain distribution. QG and IR tasks meet both criteria. For QG, we have unlabeled questions in the target domain and its dual friend IR can retrieve their corresponding input passages from the target domain. For IR, we have passages in the target domain and QG can generate their input questions. Formally, for QG, the IR model θ R retrieves passagep from P U for each q u ∈ Q U and adds (p, q u ) to S G . For IR, the QG model θ G generates a questionq for each p u ∈ P U and adds (q, p u ) to S R .

Similarities
with back-translation Backtranslation is an effective method to improve machine translation using synthetic parallel corpora containing human-produced target language sentences paired with artificial source language translations (Sennrich et al., 2016;Edunov et al., 2018). Back-training is inspired by this idea, however it is not limited to machine translation.

Consistency filters for Self-Training and
Back-Training The above algorithms utilize full unlabeled data along with their predictions even if the predictions are of low confidence. To alleviate this problem, in Back-training Self-training Self-training

Consistency Consistency
Figure 2: Self-training and Back-training for UDA.
self-training, it is common to filter low-confidence predictions (Zhu, 2005). We generalize this notion as consistency filtering: For the tasks QG and IR, a generator G ∈ {θ G , θ R } produces synthetic training data for a task whereas the critic C ∈ {θ G , θ R } filters low confidence predictions. We define two types of consistency filtering: 1) Self consistency where the generator and critic are the same. This is equivalent to filtering out model's own low confidence predictions in self-training. 2) Cross consistency where the generator and critic are different. This means θ R will filter the synthetic data generated by θ G , and vice-versa. For θ G as critic we use conditional log-likelihood log P r(q|p; θ G ) as the confidence score. For θ R as critic we use the dot product similarity between the encodings E P (p) and E Q (q) as the confidence score. Self-training and back-training can be combined with one or both of the these consistency checks. We set filter thresholds to accept 75% of synthetic data (refer to appendix A.1 for exact threshold values).
A popular data filtering technique in data augmentation is cycle consistency (Alberti et al., 2019) which is enforced by further generating noisy input from noisy output, and matching noisy input similarity with source input. We leave its exploration as future work.

Domain Adaptation Evaluation
As described in Section 2, our source domain is Nat-uralQuestions and the target domains are MLQuestions and PubMedQA. We evaluate if domain adaptation helps to improve the performance compared to no adaptation. We empirically investigate qualitative differences between self-training and backtraining to validate their effectiveness. We also investigate if consistency filters and iterative refine-   Figure 3: Evolution of QG model perplexity (PPL) and IR model loss for Self-training vs Back-training as training proceeds on MLQuestions. Trajectories run from right to left as training loss decreases with time.
Rightmost points are plotted after first mini-batch training, and subsequent points are plotted after each minibatch training. ment result in further improvements.

No-adaptation vs. self-training vs back-training
In Table 4, we compare the performance of vanilla self-training and back-training (i.e., without consistency filtering or iterative refinement) with the no-adaptation baseline (i.e. model trained on source domain and directly tested on target domain). On MLQuestions, self-training achieves an absolute gain of around 0.6 BLEU-4 points for QG and 7.13 R@20 points for IR. Whereas backtraining vastly outperforms self-training, with improvements of 9.4 BLEU-4 points on QG and 19.6 R@20 points on IR over the no-adaptation baseline.
The improvements are even bigger on PubMedQA whereas self-training shows no improvement at all. loss (and hence likelihood) are correlated and hence the data generated by back-training matches the target distribution more closely than self-training;

Why does back-training work?
(2) self-training achieves lower training error but higher test error compared to back-training, indicating overfitting; (3) extrapolating back-training curve suggests that scaling additional unlabeled data will likely improve the model. Figure 4 plots the distribution for self-training (computing likelihood scores of model's own predictions) and back-training (computing likelihood scores of different model's predictions) for QG and IR tasks on MLQuestions. The figures reveal that self-training curve has high mean and low variance, indicating less diverse training data. On the other hand, back-training curve has low mean and high variance indicating diverse training data.

Are consistency filters useful?
Table 5 reveals that although our consistency filters outperform base models on MLQuestions, the improvements are not very significant. Our hypothesis is that quality of synthetic data is already high (as backed up by Section 5.2 findings), which limits the performance gain. However, the filters reduce synthetic training data by 25%, which leads to faster model training without any drop in performance. Additionally, self-consistency improves QG IR Consistency BLEU4 ROUGE R@20 R@100   self-training in many problems (Zhu, 2005; Sachan and Xing, 2018). We believe our cross-consistency filter could also be explored on similar problems.

Is iterative refinement useful?
Further performance improvement of up to 1.53 BLEU-4 points and 2.07 R@20 points can be observed in back-training (Table 6) via the iterative procedure described in Algorithm 1. On the other hand, self-training does not show any improvements for QG and marginal improvements for IR.

Human Evaluation Results
We also report human evaluation of QG by sampling 50 generated questions from MLQuestions test set and asking three domain experts to rate a question as good or bad based on four attributes: Naturalness, i.e., fluency and grammatical correctness; Coverage, i.e., whether question covers the whole passage or only part of the passage; Factual Correctness in ML domain; Answerability, i.e., if the question can be answered using the passage. From the results in Table 7, we observe that the back-training model is superior on all four criteria. However, all models perform similarly on naturalness.
In Table 8 we present some generated questions of various models on MLQuestions and Pub-MedQA dataset. Subjectively, we find that noadaptation and self-training models fail to understand domain knowledge, generate generic questions and miss important words present in gold question. Whereas back-training generated question matches more closely to gold question.   We analyze how well our QG model can generate different kinds of questions according to the taxonomy described in Table 2. In Figure 5 we plot the confusion matrix between the actual question class and generated question class for our backtraining model. To do this, 100 actual questions

Passage Questions
If the line is a good fit for the data then No-adaptation: What is the meaning of random plot in statistics? the residual plot will be random. However, ST: What is the meaning of random plot in statistics? if the line is a bad fit for the data then BT: How do you know if a residual plot is random? the plot of residuals will be random.
Reference: How do you know if a residual plot is good?
Financial incentives for smoking cessation No-adaptation: When do we stop smoking in pregnancy? in pregnancy are highly cost-effective, ST: When do you stop smoking in pregnancy? with an incremental cost per quality BT: Is there a financial incentive for smoking cessation in pregrancy? adjusted life years of £482, which is well Reference: Are financial incentives cost-effective to support smoking and below recommended decision thresholds. cessation during pregnancy? Data Augmentation methods like self-training have been applied in numerous NLP problems such as question answering (Chung et al., 2018), machine translation (Ueffing, 2006), and sentiment analysis (He and Zhou, 2011). Sachan and Xing (2018) apply self-training to generate synthetic data for question generation and question answering (QA) in the same domain, and filter data using QA model confidence on answer generated by question. Back-translation's idea of aligning real outputs with noisy inputs is shared with back-training and has been successful in improving Unsupervised NMT (Artetxe et al., 2018;Edunov et al., 2018). Zhang et al. (2018) use back-translation to generate synthetic data for the task of automatic style transfer. Back-training also shares similarities with co-training (Blum and Mitchell, 1998;Wan, 2009) and tri-training (Li et al., 2014;Weiss et al., 2015) where multiple models of same task generate synthetic data for each other.

Conclusion and Future Work
We introduce back-training as an unsupervised domain adaptation method focusing on Question Generation and Passage Retrieval. Our algorithm generates synthetic data pairing high-quality outputs with noisy inputs in contrast to self-training producing noisy outputs aligned with quality inputs. We find that back-training outperforms self-training by a large margin on our newly released dataset MLQuestions and PubMedQA.
One area of future research will be exploring back-training for other paired tasks like visual question generation (Mostafazadeh et al., 2016) and image retrieval (Datta et al., 2008), and style transfer (Gatys et al., 2015) from source to target domain and vice-versa. The theoretical foundations for the superior performance of back-training have to be explored further.

A.1 Model Training Details
All experiments are run with same training configuration. Mean scores across 5 individual runs are provided on the test set. We describe the full model training details below for reproducibility.

BART Question Generation Transformer
We train BART-Base 2 with batch size 32 and learning rate of 1e-5. For all experiments we train the model for 5 epochs, though the model converges in 2-3 epochs. For optimization we use Adam (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.999, = 1e − 8. The question and passage length is padded to 150 and 512 tokens respectively. For decoding we use top-k sampling (Fan et al., 2018) with k = 50. The model is trained with standard cross-entropy objective.

Dense Passage Retriever (DPR)
We use publicly available implementation of DPR model 3 to train our IR system. We also use pre-trained NQ DPR checkpoint provided by We construct negative passages similar to Karpukhin et al. (2020) as the top-k passages returned by BM25 which match most question tokens but don't contain the answer. We set k = 7 for our experiments. For iterative refinement models, we always use same negative passages as the model obtained after 1st iteration (T = 1). This is because after each iteration model is being fine-tuned starting from previous model and not re-trained on pseudo-data. We obtain better performance gains on dev set following this setting.

Critic
Consistency θ G θ R Self consistency -1.19 78.24 Cross consistency -5.95 71.65 Table 9: Threshold values for different consistency filters. Values are chosen as the third quartile (Q3) of score distribution of synthetic data, accepting 75% of synthetic data for model training.  Table 9 enlists threshold values for different consistency filters. Values are arrived at by plotting confidence scores distribution of synthetic data, and setting threshold to accept 75% of the data (i.e. third quartile Q3). As explained in section 4.4, for θ G as critic we use conditional log-likelihood log P r(q|p; θ G ) as our confidence scores. For θ R as critic we use DPR similarity score E P (p)E Q (q) as our confidence scores.

A.2 NaturalQuestions Dataset Pre-processing
We use Google NaturalQuestions dataset as our source domain corpus. We pre-process publicly available train and dev corpora in a similar manner to (Mishra et al., 2020) Menner, 1936): they have a different meaning in another context -(e.g. "Ensemble", "Eager Learning", "Transformers"). This means the collected data contains OOD questions. Upon analyzing 100 random questions drawn from 104K questions, we find 27 of them are OOD.
To filter such undesirable data, we randomly sample 1000 questions and recruit 3 domain experts to label them as In-domain or OOD. 200 questions were labeled by all 3 to determine inter-annotator agreement. We record a Cohen's Kappa agreement score (McHugh, 2012) of 0.84. The 1000 annotated questions are split into sizes 800, 50, 150 for train, dev, and test sets respectively. Based on this labeled data, we train a classifier on top of question features to classify remaining questions as useful or OOD. For extracting features from questions, we utilize DistillBERT model (Sanh et al., 2019) trained on SNLI+MultiNLI (Bowman et al., 2015;Williams et al., 2018) and then fine-tuned on the STS benchmark (Cer et al., 2017) train set 5 . This gives us feature vector of size 768 which is used to train SVM classifier 6 with L2 penalty of 0.1. We carefully set the acceptance threshold relatively high to 0.8, to ensure high precision, thus accepting very few OOD questions. Figure 6 shows confusion matrix on test set with α set as 0.8. The classifier obtains high precision and average recall of 94.6% and 66% respectively.
High precision is empirically verified by annotating 100 random accepted questions, out of which 92 are found to be in-domain. The remaining 8% of the data can be treated as noise for model training. Figure 7 plots the precision-recall trade-off by varying the acceptance threshold α.

A.4 Taxonomy of MLQuestions Dataset
In Table 2, we show the distribution of various types of questions in MLQuestions and Natu-ralQuestions dataset. We split the questions into 5 categories based on Nielsen's Educational Taxonomy (Nielsen et al., 2008): descriptive questions, which ask for definitions or examples; method questions which ask for computations or procedures; explanation questions, which ask for justifications; comparison questions, which ask to compare two or more concepts; and preference questions, which are answered by a selection from a set of options. Refer to Nielsen (2008)      idation set were chosen for final model training.
• Hyperparameter configurations for bestperforming models: We provide complete hyperparameters details for QG and IR model in Appendix A.1.
• The method of choosing hyperparameter values (e.g., uniform sampling, manual tuning, etc.) and the criterion used to select among them (e.g., accuracy): We use manual tuning method with the criterion as BLEU-4 accuracy for QG and R@40 retrieval accuracy for IR task on validation set.
• Summary statistics of the results (e.g., mean, variance, error bars, etc.): Mean scores across 5 individual runs are provided for all experiments of main paper.

B.3 For all datasets used
• Relevant details such as languages, and number of examples and label distributions: Section 2 provide statistics of NaturalQuestions, MLQuestions, and PubMedQA datasets. All datasets are in English language.
• Details of train/validation/test splits: This is also provided in section 2 for all three datasets.
• Explanation of any data that were excluded, and all pre-processing steps: Relevant details are provided in section 2 for all three datasets.
• A zip file containing data or link to a downloadable version of the data: We provide MLQuestions dataset in the submission zip file. The NaturalQuestions and PubMedQA dataset can be downloaded from https: //ai.google.com/research/ NaturalQuestions/download and https://github.com/pubmedqa/ pubmedqa repsectively. The datasets can be pre-processed following the procedures mentioned in section 2.
• For new data collected, a complete description of the data collection process, such as instructions to annotators and methods for quality control.: We provide above details for our newly created dataset MLQuestions in section 2.2.