Single-dataset Experts for Multi-dataset Question Answering

Many datasets have been created for training reading comprehension models, and a natural question is whether we can combine them to build models that (1) perform better on all of the training datasets and (2) generalize and transfer better to new datasets. Prior work has addressed this goal by training one network simultaneously on multiple datasets, which works well on average but is prone to over- or under-fitting different sub- distributions and might transfer worse compared to source models with more overlap with the target dataset. Our approach is to model multi-dataset question answering with an ensemble of single-dataset experts, by training a collection of lightweight, dataset-specific adapter modules (Houlsby et al., 2019) that share an underlying Transformer model. We find that these Multi-Adapter Dataset Experts (MADE) outperform all our baselines in terms of in-distribution accuracy, and simple methods based on parameter-averaging lead to better zero-shot generalization and few-shot transfer performance, offering a strong and versatile starting point for building new reading comprehension systems.


Introduction
The goal of reading comprehension is to create computer programs that can answer questions based on a single passage of text. Many reading comprehension datasets have been introduced over the years, and prior work has explored ways of training one network on multiple datasets to get a model that generalizes better to new distributions (Talmor and Berant, 2019;Fisch et al., 2019;Khashabi et al., 2020). Our goal is to build a multidataset model that performs well on the training distributions and can also serve as a strong starting with a shared Transformer model θ. They are optimized jointly on a set of training datasets (left). To transfer to a new dataset (right), we either average the parameters of the adapters, or fine-tune all adapters on the target dataset and take the weighted average at the end (Section 3). point for transfer learning to new datasets. Multidataset training provides a way to model the regularities between datasets but it has the following shortcomings. First, multi-task models are liable to over-or under-fit different tasks (Gottumukkala et al., 2020), which can result in worse in-distribution accuracy. Second, given a particular target dataset, multi-dataset models might achieve worse transfer performance compared to a specialized model trained on a more similar source dataset.
Our idea is to combine the benefits of singleand multi-dataset by training a collection of singledataset experts that share an underlying Transformer model (Figure 1). This system is based on adapters (Houlsby et al., 2019), lightweight task-specific modules interleaved between the layers of a pre-trained Transformer (e.g., BERT; . The standard use of adapters is as a parameter-efficient alternative to fine-tuning: task-specific adapters are trained separately on top of a frozen Transformer, which means the adapters cannot directly learn cross-task regularities. We instead first train a shared Transformer in a multi-adapter setup before refining adapters for individual datasets, which we call Multi-Adapter Dataset Experts (MADE). Our intuition is that the shared parameters encode regularities across different reading comprehension tasks while the adapters model the sub-distributions, resulting in more accurate and robust specialized models that transfer better to a variety of target datasets.
We apply this approach to a range of extractive question answering datasets from the MRQA 2019 shared task (Fisch et al., 2019), training MADE on six in-domain datasets and evaluating generalization and few-shot transfer learning to six outof-domain datasets. The resulting system outperforms single-and multi-dataset models in terms of in-domain accuracy, and we find that a simple approach to transfer learning works well: averaging the parameters of the MADE adapters results in a single model that gets better zero-shot generalization and few-shot transfer performance compared to both baselines as well as a state-of-theart multi-dataset QA model, UnifiedQA (Khashabi et al., 2020). Our experiments illustrate the benefits of modeling both cross-dataset regularities and dataset-specific attributes, and the trained models offer a strong and versatile starting point for new question-answering models.

Related Work
Prior work has addressed multi-dataset reading comprehension by fine-tuning a pre-trained Transformer language model  simultaneously on examples from multiple datasets (Talmor and Berant, 2019;Fisch et al., 2019). Several works explore different multi-task sampling schedules, as a way of mitigating training set imbalances (Xu et al., 2019;Gottumukkala et al., 2020). Another line of work focuses on training models to answer a wider variety of question types, including UnifiedQA (Khashabi et al., 2020), a T5 model (Raffel et al., 2020 trained on datasets with different answer formats, such as yes/no and multiple-choice, using a unified text-to-text format. Adapters (Houlsby et al., 2019;Rebuffi et al., 2018) are task-specific modules interleaved between the layers of a shared Transformer. Stickland and Murray (2019) trained task adapters and the Transformer parameters jointly for the GLUE benchmark (Wang et al., 2019) but achieved mixed results, improving on small datasets but degrading on larger ones. Subsequent work has used a frozen, pre-trained Transformer and trained task adapters separately. Researchers have explored different methods for achieving transfer learning in this setting, such as learning to interpolate the activations of pre-trained adapters (Pfeiffer et al., 2021).

Problem Definition
The objective of reading comprehension is to model the distribution p(a | q, c), where q, c, a ∈ V * represent a question, supporting context, and answer respectively and consist of sequences of tokens from a vocabulary V. For simplicity, we focus on extractive reading comprehension, where every question can be answered by selecting a span of tokens in the context, but the approach is generic and can be extended to other formats. We make the standard assumption that the probability of context span c i...j being the answer can be decomposed into the product of p(start = i | q, c) and p(end = j | q, c).
We consider a collection of source datasets S and target datasets T , where each dataset D ∈ S ∪ T consists of supervised examples in the form (q, c, a). The goal is to train a model on S that achieves high in-domain accuracy and transfers well to unseen datasets in T , either zero-shot or given a small number of labeled examples.

Multi-dataset Fine-tuning
The standard approach to multi-dataset reading comprehension is to fit a single model to examples drawn uniformly from the datasets in S: where θ refers to the parameters of an encoder model (usually a pre-trained Transformer like BERT; , which maps a question and context to a sequence of contextualized token embeddings, and ψ denotes the classifier weights used to predict the start and end tokens. The objective is approximated by training on mixed mini-batches with approximately equal numbers of examples from each dataset (Fisch et al., 2019;Khashabi et al., 2020), although some researchers have investigated more sophisticated sampling strategies (Xu et al., 2019). For example, Gottumukkala et al. (2020) introduce dynamic sampling, sampling from each dataset in inverse proportion to the model's current validation accuracy.

MADE
Our approach is to explicitly model the fact that our data represent a mixture of datasets. We decompose the model parameters into a shared Transformer, θ, and dataset-specific token classifiers ψ = ψ 1 , . . . , ψ |S| and adapters φ = φ 1 , . . . , φ |S| ( Figure 1). We use a two-phase optimization procedure to fit these parameters. In the joint optimization phase, we jointly train all of the parameters on the source datasets: After validation accuracy (average F1 scores of the source datasets) stops improving, we freeze θ and continue adapter tuning, refining each pair of (φ i , ψ i ) separately on each dataset.
Zero-shot generalization We use a simple strategy to extend MADE to an unseen dataset: we initialize a new adapter and classifier (φ , ψ ) by averaging the parameters of the pre-trained adapters and classifiers φ 1 , . . . , φ |S| and ψ 1 , . . . , ψ |S| , and return the answer with the highest probability under p θ,φ ,ψ (a | q, c). We also considered an ensemble approach, averaging the token-level probabilities predicted by each adapter, but found this to perform similarly to parameter averaging at the additional cost of running the model |S| times.
Transfer learning We also consider a transfer learning setting, in which a small number of labeled examples of a target domain (denoted D tgt ) are provided. We explore two ways to build a single, more accurate model. The first is to initialize (φ , ψ ) as a weighted average of pre-trained using D tgt to estimate the mixture weights. For each i, we set the mixture weight α i to be proportional to the exponential of the negative zero-shot loss on the training data: and then tune θ and (φ , ψ ) on the target dataset. The second approach is to first jointly tune θ, φ, and ψ on D tgt , maximizing the marginal likelihood: and then take the weighted average of the parameters, calculating the mixture weights α i as above but using the loss of the fine-tuned adapters on a small number of held-out examples from D tgt . Preaveraging is faster to train, because it only involves training one model rather than all |S| adapters. After training, both approaches result in a single model that only requires running one forward pass through (θ, φ , ψ ) to make a prediction.

Setup
We use the datasets from the MRQA 2019 shared task (Fisch et al., 2019), which are split into six large in-domain datasets, 2 and six small out-ofdomain datasets. 3 Dataset statistics are in Appendix A.1. We use the RoBERTa-base model (Liu et al., 2019) with the default adapter configuration from Houlsby et al. (2019), which adds approximately 1.8M parameters to the~128M in RoBERTa-base (1%).    Single-dataset adapters / zero-shot F1

In-domain Performance
First we train MADE on the six training datasets and compare in-domain accuracy with single-and multi-dataset fine-tuning and standard adapter training (freezing the Transformer parameters). For context, we also compare with a method from recent work, dynamic sampling (Gottumukkala et al., 2020), by sampling from each dataset in proportion to the difference between the current validation accuracy (EM+F1) on that dataset and the best accuracy from single-dataset training. We train all models by sampling up to 75k training and 1k development examples from each dataset, following Fisch et al. (2019). More details are in Appendix A.2. Table 1 shows that MADE scores higher than both single-and multi-dataset baselines. Both phases of MADE training-joint optimization followed by separate adapter tuning-are important for getting high accuracy. Jointly optimizing the underlying MADE Transformer improves perfor-mance compared to single-dataset adapters, suggesting that joint training encodes some useful cross-dataset information in the Transformer model. Adapter tuning is important because the multidataset model converges on different datasets at different times, making it hard to find a single checkpoint that maximizes performance on all datasets (see Appendix Figure 3 for the training curve). Some of the improvements can also be attributed to the adapter architecture itself, which slightly outperforms fine-tuning in most datasets. Dynamic sampling does not improve results, possibly because the datasets are already balanced in size. Table 2 shows the results of applying this model to an unseen dataset (zero-shot). We compare a simple method for using MADE-averaging the parameters of the different adapters-with the multidataset model from Section 4.2, averaging the parameters of single-dataset adapters, and the pre-  trained UnifiedQA-base (Khashabi et al., 2020). 4 We compare MADE with and without the second phase of separate adapter-tuning. Surprisingly, averaging the parameters of the different MADE adapters results in a good model, generalizing better on average compared to both multi-dataset models. The second phase of adaptertuning improves these results. Parameter averaging performs poorly for single-dataset adapters, possibly because the separately-trained adapters are too different from each other to interpolate well. Figure 2 compares the zero-shot accuracy obtained by the different MADE and single-dataset adapters. The two sets of adapters show similar patterns, with some adapters generalizing better than others, depending on the target, but all of the MADE adapters generalize better than the corresponding single-dataset adapters. This performance gap is considerably bigger than the gap in in-domain performance (Table 1), further illustrating the benefit of joint optimization.

Transfer Learning
Finally, we compare two ways of using MADE for transfer learning: either averaging the adapter parameters and then fine-tuning the resulting model (pre avg.), or first fine-tuning all of the adapters and then taking the weighted average (post avg.). In both cases, we also back-propagate through the 4 UnifiedQA was trained on different datasets with a different architecture, but represents an alternative off-the-shelf model for QA transfer learning. We compare to UnifiedQAbase because the encoder has approximately the same number of parameters as RoBERTa-base. Transformer parameters. We reserve 400 examples from each target dataset to use as a test set (following Ram et al., 2021) and sample training datasets of different sizes, using half of the sampled examples for training and the other half as validation data for early stopping and to set the mixture weights for averaging the adapter parameters.
The results are in Table 3. On average, MADE leads to higher accuracy compared to the baselines, with bigger improvements for the smaller sizes of datasets, showing that a collection of robust singledataset experts is a good starting point for transfer learning. The post-average method performs about the same as averaging at initialization in the lowerdata settings, and better with K = 256. All models struggle to learn with only 16 examples, and on DuoRC, which has long contexts and distant supervision and might represent a more challenging target for few-shot learning. We also experimented with single-dataset adapters and with a frozen Transformer, which perform worse; detailed results are in Appendix B.2.

Conclusion
MADE combines the benefits of single-and multidataset training, resulting in better in-domain accuracy, generalization, and transfer performance than either multi-dataset models or single-dataset models, especially in low resource settings. For future work we plan to explore explicit mixturemodeling approaches for better zero-shot prediction and transfer learning.

A.1 Dataset Details
We use the pre-processed datasets from the MRQA 2019 shared task (Fisch et al., 2019). Table 4 provides some dataset statistics.

A.2 Training Details
Our models are implemented in PyTorch (Paszke et al., 2019) using HuggingFace (Wolf et al., 2020) and the adapter-transformers library (Pfeiffer et al., 2020). For all in-domain experiments, we sample 75,000 training and 1,000 validation examples and train with a constant learning rate and a batch size of 8, taking checkpoints every 1024 steps and stopping if validation F1 fails to improve for 10 checkpoints up to a fixed maximum number of epochs (10 for single-dataset training and 3 epochs for multi-dataset training). We use a constant learning rate of 1e-5 for Transformer parameters and 1e-4 for adapter parameters, following standard settings for RoBERTa and adapters respectively (Liu et al., 2019;Houlsby et al., 2019), and use the AdamW optimizer (Loshchilov and Hutter, 2018) with the HuggingFace default parameters. For the multidataset models, we construct mini-batches of size B by repeating B times: pick a dataset uniformly, and pick an example uniformly from that dataset. We train all models on single 2080Ti GPUs with 11GB of memory each. The multi-dataset models take around two days to train, the single-dataset models take less than 24 hours, and it takes about 2 hours to train one model sequentially on six transfer datasets for three values of K and three seeds.
Distant supervision Some datasets provide the gold answer string but do not mark the gold answer span in the context. We train the model to maximize the marginal likelihood of the gold answer string, marginalizing over all occurrences in the context. The set of possible answer spans are annotated in the pre-processed MRQA datasets.
Long contexts For inputs that are longer than the maximum input window for RoBERTa (512 tokens), we use a sliding window to split in the input into multiple "chunks": every input begins with the full question and the [cls] and separator tokens, and we fill the rest of the input window with tokens from the context, sliding the window 128 characters with each stride. At prediction time, we return the answer from the chunk with that has the highest predicted probability.
Negative examples We follow Longpre et al. (2019) and include "negative examples" during training. If a context chunk does not contain the answer span, we include the example as a training instance and train the model to indicate that the example does not contain the answer by selecting the [cls] token as the most likely start and end span. At prediction time, we discard "no answer" predictions and return the non-empty answer from the chunk with that has the highest predicted probability. For UnifiedQA, we train the model to predict an empty string for contexts that don't contain the answer to string and at prediction time return the non-empty answer with the highest probability.

A.3 Transfer learning details
For transfer learning, we take 1/2 of the K training examples for validation and train for 200 steps or until the validation loss fails to improve for 10 epochs, and we reduce the adapter learning rate to 1e-5. The other hyper-parameters are the same as for in-domain learning.
Training UnifiedQA We download the pretrained UnifiedQA-base model from HuggingFace and train it in the format described in Khashabi et al. (2020) and in the accompanying code release. 5 We lower-case the question and context strings and concatenate them with a special string "\n", which represents the backslash character followed by the letter n; and train the model to generate the answer string by minimizing cross-entropy loss. We use greedy decoding for prediction. In our pilot experiments, the recommended optimizer (Adafactor with a learning rate of 1e-3) quickly over-fits, so we use the same optimizer, learning rate, and batch size as for RoBERTa. Figure 3 shows the training curve for training MADE, normalized by dividing each checkpoint score by the maximum validation accuracy obtained on that dataset during this run. The model reaches the maximum performance on the "easy" datasets early in training, which means that the model might over-fit to those datasets before converging on the more difficult datasets. MADE avoids this problem by tuning the adapter parameters separately after joint optimization. Inter-  estingly, adapter-tuning leads to improved performance on all datasets (Table 1), even datasets on which joint-optimization appears to have already converged. Table 5 provides additional transfer learning results. Single-dataset adapters transfer worse than MADE, although performance improves considerably compared to zero-shot performance (Table 2). We observe that the transfer process heavily downweights some single-dataset adapters (like Trivi-aQA and SearchQA) that get high loss either before or after training, which might explain the performance improvement. Freezing the Transformer parameters slighly improves results in the K = 16 setting but leads to worse performance with more data. The biggest drop is on BioASQ, possibly because it introduces new vocabulary and it is beneficial to update the token embeddings.  Table 5: Transfer learning to MRQA out-of-domain datasets with K training examples (F1, averaged over three random seeds), using the MADE model with adapter-tuning. † : RACE is part of the UnifiedQA training data. pre avg.: we take the weighted average of adapters at initialization before fine-tuning; post avg.: we fine-tune each adapter jointly and average them at the end. Freeze θ refers to experiments where we freeze the Transformer parameters rather than tuning them along with the adapters.