Beyond Reptile: Meta-Learned Dot-Product Maximization between Gradients for Improved Single-Task Regularization

Meta-learning algorithms such as MAML, Reptile, and FOMAML have led to improved performance of several neural models. The primary difference between standard gradient descent and these meta-learning approaches is that they contain as a small component the gradient for maximizing dot-product between gradients of batches, leading to improved generalization. Previous work has shown that aligned gradients are related to generalization, and have also used the Reptile algorithm in a single-task setting to improve generalization. Inspired by these approaches for a single task setting, this paper proposes to use the ﬁnite differences ﬁrst-order algorithm to calculate this gradient from dot-product of gradients, al-lowing explicit control on the weightage of this component relative to standard gradients. We use this gradient as a regularization tech-nique, leading to more aligned gradients between different batches. By using the ﬁnite differences approximation, our approach does not suffer from O ( n 2 ) memory usage of naively calculating the Hessian and can be easily applied to large models with large batch sizes. Our approach achieves state-of-the-art performance on the Gigaword dataset, and shows performance improvements on several datasets such as SQuAD-v2.0, Quasar-T, NewsQA and all the SuperGLUE datasets, with a range of models such as BERT, RoBERTa and ELECTRA. Our method also outperforms previous approaches of Reptile and FOMAML when used as a regularization technique, in both single and multi-task settings. Our method is model agnostic, and introduces no extra trainable weights.


Introduction
Meta-learning algorithms such as MAML (Finn et al., 2017), FOMAML, and Reptile (Nichol et al., 2018), which modify gradient descent by effectively differentiating through it, have lead to performance improvements on several datasets such as MiniImageNet (Vinyals et al., 2016), Omniglot (Lake et al., 2011) and Java Github Corpus (Allamanis and Sutton, 2013). When used in a single-task setting, the only significant difference between these algorithms and standard SGD is that the meta-gradient from these algorithms also contains as a component the gradient for maximizing the dot-product between gradients for examples/batches, as was theoretically proven in the Reptile paper.
The Reptile algorithm has also been leveraged to improve single-task performance across a range of models and tasks, such as in Kedia and Chinthakindi (2021). Second-order methods for aligned gradients have been explored before in the context of continual learning, such as in Riemer et al. (2019), Lopez-Paz and Ranzato (2017), Chaudhry et al. (2018). Some recent work, such as Fort et al. (2019), Chatterjee (2020), and Yu et al. (2020) have also shown that aligned gradients are related to improved generalization and model performances. We conjecture that aligned gradients in single-task settings will also improve learning across examples, enabling better transfer from one example to another, similar to as often done in continual/multi-task approaches such as Riemer et al. (2019).
However, a naive approach to directly maximize the dot-product of gradients requires a calculation of the Hessian Matrix, which scales as O(n 2 ) in memory usage where n is the number of model parameters. This approach also fails to work with gradient accumulation, leading to a hard limit on the batch size, reducing training accuracy. Even though some recent works such as Anil et al. (2020) have tried to make this tractable using large distributed environments, the computation costs are extremely high for any reasonably large model. Approaches like the Hessian-Vector Product (Pearlmutter, 1994) also do not work with gradient accumulation. Inspired by the above approaches and to fix the aforementioned issues, we propose to explicitly calculate the gradient for maximizing the dot product between batches using a first-order approximation. We use this gradient as a regularizing term, and show that it results in improved performance across a range of models and tasks. Our main contributions are-• Use the first-order finite differences method (Smith et al., 1985) to explicitly calculate the gradient from the dot-product of gradients. • Utilize the above gradient to regularize the training of models in single task settings. • Leads to significant performance improvements across a wide range of tasks and datasets such as SQuAD-v2.0, Quasar-T and all SuperGLUE datasets. Achieves State-ofthe-art performance on Gigaword. • Outperforms previous approaches such as Reptile and FOMAML in single-task as well as multi-task setting. • Is model agnostic, with no extra trainable weights. • Improves performance across a range of model sizes and pre-training, such as BERT, ELECTRA, RoBERTa, and for small, base and large models.
2 Proposed Method 2.1 Background on Reptile and MAML Algorithms MAML The MAML algorithm, initially intended for multi-task few-shot learning, proposed to do k steps of "inner" gradient updates, after which the loss was computed and minimized on the (k + 1) th batch, with respect to the original weights before the k inner steps. The gradient from this loss is then used for an "outer" update to the original weights. This requires differentiating through the optimizer, and is a second order method.
FOMAML The authors of MAML also proposed FOMAML, which is a first order approximation of MAML bypassing the differentiation through the optimizer. This method also achieves significant improvement compared to vanilla learning algorithms.
Reptile The Reptile algorithm is similar to FO-MAML, and also does k inner steps of gradient updates. For the outer update, Reptile uses the difference between the original weights and the inner weights as the gradient. The Reptile paper showed Figure 1: Our proposed algorithm: Calculating the gradient for maximizing dot product using finitedifferences approximation, and using it for regularization of standard gradient.
that the gradient for all these 3 approaches is similar to vanilla SGD, except for a small component which maximizes dot-product between batches - where G avg is the expected SGD gradient from a batch, α is the inner-step size, and G Inner is the gradient for maximizing dot-product between batches. The gradient is similar for MAML and FOMAML, only differing in the constants. But this approximation is only valid for small α, which reduces the ability of G inner to regularize the training. By computing G inner explicitly, we aim to overcome this limitation.

Our Approach: Meta DotProd
Our proposed regularization scheme is inspired by the inner loop of the Reptile algorithm, and uses the finite-differences method to approximate the Hessian-vector product. Algorithm 1 shows how to calculate the gradient for maximizing the dot product of gradients using the finite differences method applied to an SGD optimizer. Essentially, we calculate the gradients G 1 and G 2 from batches b 1 and b 2 , and then temporarily update the network parameters with α * G 1 . Then we calculate G 2,1 with batch b 2 again with the new network parameters and use this gradient to calculate the dot product gradient G 1 G 2 . Once this gradient of dot product is calculated, unlike Reptile and FOMAML, we can explicitly control its relative weight by adjusting the hyperparameter LR G 1 G 2 .
The last line of the algorithm is the SGD update, which can be substituted by any other optimizer. In our experiments, we use the original model's default optimizer, which range from SGD to Adam (Kingma and Ba, 2015) to AdaFactor (Shazeer and Stern, 2018).
Gradient accumulation, if any, can be stored in G 1 , G 2 and G 2,1 before applying this algorithm. The compute and storage needed for our method scales linearly with model size (approx. 50%), allowing us to apply this to large models such as BERT, with significantly smaller overhead compared to calculating the Hessian. This overhead can be reduced to 10% without significantly impacting our method, as we show in subsection 6.3.

Theoretical Analysis
In this section, we provide a theoretical analysis of our meta update of the M eta DotP rod algorithm. We generalize the Taylor expansion approach as used in Nichol et al. (2018), and show how our approach maximizes inner product of gradients between different mini-batches. This approach is essentially the expectation of the finite differences method over stochastic mini-batch sampling for calculating the Hessian-Vector product, but we present it here for clarity.
We consider two input batches b 0 , b 1 at the be-ginning of i th step. For j ∈ {0, 1} we defineθ i,0 = network weights before i th step, L j+1 = loss function corresponding to b j , Then, our update rules are - Using the first order Taylor expansion of G 2 , we get - For small α, we can ignore the terms involving O(α 2 ) in (3). This term becomes O(α) in (4), but it is still α = 1e −7 times smaller than G 1 H 2 and hence can be safely ignored.
Under the expectation of stochastic mini-batch sampling, E[G 1 H 2 ] = E[G 2 H 1 ], and the above equation (4) becomes - giving exactly the gradient for maximizing the dot product between the gradients. Note that the above approximation relies on α being a small enough value -If α is too large, the approximation breaks down, and the performance improvement decreases. This is particularly relevant as the relative weight of G 1 .G 2 component in the Reptile and FOMAML algorithms is directly proportional to this α -limiting the ability to adjust the importance of G 1 .G 2 , and hence limiting performance, as we will show in section 5.

Corpus
Task |Train| |Dev|  We describe the dataset size and tasks for each of our datasets in Table 1 and Table 2, and give a short description below.  Gigaword Gigaword (See et al., 2017) is an English summarization dataset with single-line input documents from news sources, and the task is to generate headlines.
Quasar-T An MRC retrieval dataset from Dhingra et al. (2017), it consists of cloze-style queries constructed from definitions on the website Stack Overflow. It is split into queries with smaller context documents (Quasar-T Short) and with longer context (Quasar-T Long). We only use the subset of the dataset in which the answer is an exact span.
Omniglot A dataset containing 20 hand-drawn samples of characters from 50 different alphabets, similar to the popular MNIST (Deng, 2012) dataset. Similar to Finn et al. (2017), we use the first 1200 classes as train and the others as test.
Mini-Imagenet A dataset containing 100 random classes from the ImageNet dataset (Deng et al., 2009), resized to 84x84 images, each class having 600 examples.

Models
We describe the model size and speeds of all our models in Table 3 and give a short description below.
BERT BERT (Devlin et al., 2019) is a transformer (Vaswani et al., 2017) model, and its derivatives and improvements are the backbone of most state-of-the-art models in NLP. We use the BERT-large-cased official implementation from Jiant (Wang et al., 2019) for SuperGLUE, and the official implementation of BERT-base-uncased and BERT-large-uncased from Rajpurkar et al. (2018) for SQuAD, and re-use the same for QUASAR-T.
RoBERTa Roberta is a model with the same architecture as BERT, but the pre-training objectives and parameters are selected more carefully, and  with larger pre-training data. We use the official checkpoints and hyper-parameters from Liu et al. (2019) for SQuAD, and re-use the same for Quasar. As the hyper-parameters for SuperGLUE for Roberta are not available, we re-use the official BERT hyper-parameters.
ELECTRA Electra (Clark et al., 2020) pretrains a BERT-like transformer model to discriminate between real and fake input tokens generated by another smaller network. Models deriving from ELECTRA achieve state-of-the-art performance on a range of NLU tasks. We use the ELECTRA-small official implementation for SQuAD.
Pegasus A state-of-the-art model for summarization tasks, Pegasus model has the standard base architecture of encoder-decoder transformer, but is pre-trained at the task of generating missing sentences. We use the official model and parameters (Zhang et al., 2020) for Pegasus for Gigaword.
Conv-4 A convolution network, with 4 blocks of conv2d, batch normalization and relu activation, followed by a dense layer with heads for classification. We use the official model from Nichol et al. (2018) for all our experiments on Omniglot and Mini-ImageNet.

Implementation Details
We train each corresponding model 8 times on each dataset's training set (5 on SQuAD) and report the mean and standard error of these scores. We use one Nvidia V100 for all our experiments (8 for gigaword). As our algorithm is first order, it incurs a linear performance overhead compared to the original model. All experiments run in less than a day, except for Gigaword, MultiRC, ReCoRD and Omniglot-20-way, which run in a few days.
As the test sets are hidden for SuperGLUE and SQuAD, we provide results on the dev set instead. We provide dev set results on Quasar-T as well as we use only the subset mentioned above. We only evaluate once on the Gigaword test set, and hence no standard error is provided.
Hyper-parameters Details of all default/official model hyper-parameters for each model/dataset, can be found in their source codes, whose links are available in the supplemental material. Wherever official hyper-parameters are not available, we have re-used hyper-parameters from other similar models/datasets, as described in subsection 4.2. Except in the ablation study for LR G 1 G 2 , we use a fixed value of α as 1e −7 and LR G 1 G 2 as 0.1 for all our experiments. α was chosen as this value as it has to be small for the first order approximation to hold. LR G 1 G 2 was chosen as 10% of the standard gradient so as to not overshadow the standard gradient for the task, while still providing enough gradient to maximize dot product. We keep k for Reptile and FOMAML as 4.

Results on SuperGLUE datasets
As shown in Table 4, our method consistently improves the performance of both BERT and RoBERTa models on all SuperGLUE datasets. For the BERT model, we show performance gains of 2.5, 1.5, 2.7 and 0.7 in accuracy on BoolQ, CB, COPA and RTE, and 1.2, 0.6 in EM on MultiRC and ReCoRD. With RoBERTa model, we also observe significant performance improvement of 5 and 3 in accuracy on CB and RTE, and 1.5 in EM    on MultiRC, with minor improvements on other datasets.

Results on other datasets
As shown in Table 5, our method shows performance gains of 1.3 in F1, and 1.2 in EM on SQuAD with the BERT model. Our approach also achieves state-of-the-art performance on the Gigaword dataset, as shown in Table 6. Our approach when applied on the baseline PEGASUS model results in improvement of 0.9 in ROUGE-1, 0.5 in ROUGE-2 and 0.1 in ROUGE-L metrics. We also show performance improvements on Quasar-T dataset, of 0.8 in F1, and 1.0 in EM on the Quasar-T (long) dataset compared to the baseline model of BERT, and of 0.7 in F1, and 0.8 in EM on the Quasar-T (short) dataset compared to the baseline model of RoBERTa as we show in Table 7. Our method also improves the score of the BERT model on the NewsQA dataset by 0.5 in F1.

Results on varying model size and pre-training
To demonstrate the effect of varying model size, as well as improving pre-training, in Table 9 we show the results of using DotProd on Electra-small, BERT-base and RoBERTa-large on SQuAD dataset. Our method improves performance across the entire range of models with varying pre-training strategies and sizes. Furthermore, our method is applicable on even larger models such as Pegasus as shown previously, and on even smaller models such as Conv-4, as we will show in the next section.

Comparison to other Meta-Learning Methods -Few-Shot Multi-task Learning
While the focus of our approach is specifically on single-task learning, we also evaluate our method on few-shot multi-task learning on Omniglot and Mini-Imagenet datasets as done in the Reptile and MAML papers, to see the effectiveness of a higher G 1 G 2 compared to Reptile and MAML. As shown in Table 8, our method shows consistent improvements against Reptile and FOMAML on the Mini-Imagenet dataset, with a performance gain of 2.53 and 2.77 in 1-shot 5-way and 5-shot 5-way classification respectively against Reptile, and an improvement of 1.53 and 2.36 in 1-shot 5-way and 5-shot 5-way classification against FOMAML. We also observe consistent performance improvements of 2.19, 0.68, 1.81 and 0.87 against the Reptile approach on 1-shot 5-way, 5-shot 5way, 1-shot 20-way and 5-shot 20-way classifications respectively on Omniglot dataset. When compared against the FOMAML approach our method achieves a performance gain of 0.38 and 0.55 in 5-shot 5-way and 1-shot 20-way classifications respectively on Omniglot dataset. Note that the Reptile scores and our Dot Prod scores are without trasduction, whereas FOMAML reported scores are transductive, which boosts FOMAML scores.

Comparison to other Meta-Learning
Methods -Single Task In Table 10, we compare our method against the Reptile algorithm and FOMAML, on SQuAD-v2.0 dataset with BERT-large model. Note that the dotproduct gradient component in Reptile and FO-MAML is directly proportional to α, but α has to Method Mini-Imagenet Omniglot 1-shot 5-w 5-shot 5-w 1-shot 5-w 5-shot 5-w 1-shot 20-w 5-shot 20-
be small for the first-order approximation to hold, representing a direct conflict which limits the performance gains from these methods. Our method does not suffer from this limitation, improving performance.
6.2 Effect of LR G 1 G 2 In Figure 3, we compare the effect of different values of LR G 1 G 2 on SQuAD-v2.0 dataset with BERT-large model. The ability to select a higher weightage of G 1 G 2 is indeed effective, improving the performance of the model on both F1 and EM scores. Furthermore, the performance improve-

Effect of infrequent regularization
While algorithm 1 is first order, it introduces an overhead of 50% in computation. In order to minimize this overhead, instead of computing G 1 G 2 for every two batches, an alternative is to use standard gradient updates for some batches, and only apply this regularization infrequently for a smaller number of batches. We study the effect of this on performance on SQuAD dataset with BERT-large model, by applying our method every 2, 3, 5, 8 and 10 batches, with overheads 50%, 33%, 20%, 12%, and 10% respectively. Even with only 10% over- head, our regularization still results in significant performance gains of 1.0 F1 and 0.9 EM, with the improvement decreasing only slightly with reducing overhead.

Analysis of Gradient Dot Products
In Figure 4, we directly demonstrate that our method is indeed effective in boosting the dotproduct between gradients while training. We compare two runs of BERT with the same seed on SQuAD-v2.0, and we plot the dot-product between gradients of batches. To minimize the effect of noise, we smoothen the plot by using a moving window of 100 batches, and remove outlier points more than 10σ away from the mean. During training, while dot-product naturally decreases to zero as the model converges (as also shown previously in Fort et al. (2019)), our approach significantly boosts the dot-product compared to the baseline, remaining consistently around 50 − 100% larger the baseline throughout the training period.

Discussion of Effects on Training Dynamics
Training stability remains unaffected on all models/datasets we tried, and even improves slightly on Mini-Imagenet and Omniglot. Compared to our algorithm, the reptile algorithm appears unstable, perhaps due to larger α. Our DotProduct method does not appear to make the model converge faster, with the rate of decrease of loss remaining almost identical to the baseline, but it does converge to a slightly lower loss. While the value of LR G 1 G 2 , was kept fixed at 0.1 in our experiments, model convergence remains unaffected upto around a value of 1.
Higher values of this hyper-parameter may be helpful depending on the dataset. For example, we observed the scores on the QUASAR-Long dataset are ever higher with LR G 1 G 2 set at 0.5, but we do not tune this parameter for different datasets in this paper. Also, while our algorithm is essentially the finite differences method to calculate the Hessian Vector product, we use the one-sided rather than centered version of finite differences to reduce compute overhead of our method.

Transformer Models
Transformer models (Vaswani et al., 2017) are the backbone of most state-of-the-art NLP models. Models and pre-training techniques such as BERT (Devlin et al., 2019), RoBERTa , ELECTRA (Clark et al., 2020) and Pegasus (Zhang et al., 2020) have made large improvements in the performance of NLP models. See subsection 4.2 for a detailed discussion of these models.

Meta-learning
Several works have explored Meta-learning, directly modify the learning process, such as by differentiating through the optimizer, such as MAML (Finn et al., 2017), Reptile (Nichol et al., 2018), Andrychowicz et al. (2016), Chen et al. (2017, giving performance improvements across a range of datasets and tasks. See subsection 2.1 for detailed descriptions of some meta-learning algorithms. While these approaches were initially proposed for few-shot multi-task learning, Kedia and Chinthakindi (2021) utilized the Reptile algorithm in single-task learning to improve generalization. Our approach is inspired from Reptile, but unlike Reptile, it gives us direct, explicit control over the importance of gradients' dot product.

Aligned Gradients
Previous works have explored alignments of gradients in the field of multi-task learning and continual learning, such as in Riemer et al. (2019), Lopez-Paz andRanzato (2017), Chaudhry et al. (2018). Unlike these approaches, our method is First Order and does not require storing previously seen examples. Some recent works such as Fort et al. (2019) and Chatterjee (2020) also show that aligned gradients between examples is related to improved generalization and model performance.
PCGrad (Yu et al., 2020) proposes to minimize conflicting gradients by projecting only conflicting gradients to a normal place, while leaving aligned gradients unmodified, achieving significant improvements in multi-task supervised and RL tasks on image datasets. Fort et al. (2019) in particular, proposed gradient alignment as a metalearning direction for future work, which this paper explores.

Conclusion
We propose to use finite-differences to calculate the gradient from the dot-product of gradients, and demonstrate its effectiveness as a regularization technique, leading to more aligned gradients between different batches. We leverage this approach to show performance improvements on several datasets such as SQuAD-v2.0, Quasar-T, and all the SuperGLUE datasets, and achieves state-of-the-art performance on Gigaword. Our method is effective over a range of models and model sizes, such as BERT, RoBERTa and Electra. Our method outperforms the Reptile and the FOMAML algorithm in single-task and few-shot multi-task settings, is first-order, is model-agnostic, and can be used with large models and large batches.

A Comparison to Previously Published Scores
A.1 SQuAD BERT The BERT paper reports scores of bestperforming model. The mean score is a more robust measure, given the significant variation when fine-tuning BERT models, and hence we choose to report mean scores. Below are the results of our best performing model.

Method F1
EM BERT (Devlin et al., 2019) 81.9 78.7 +DotProd 83.16 80.28 RoBERTa The RoBERTa paper reports a score of 86.5 EM and 89.4 F1 on SQuAD. Using the official parameters listed in the RoBERTa paper and their official checkpoint, while using the official SQuAD implementation of BERT, none of our 5 runs crossed 86.0 EM score. This is likely due to differences in handling unanswerable questions, long sequence length documents, etc. We report our reproduced scores in our main paper using the official SQuAD implementation of BERT with RoBERTa hyperparameters.
ELECTRA The Electra paper does not report SQuAD scores for Electra small. We report the reproduced scores using the official Electra github source code with default hyper-parameters. Note that Electra github reports a median score of 70.1 EM for Electra small, but none of our runs reached this performance, even on running 15 runs using fully official code and checkpoints. 1

A.2 SuperGLUE
Below we compare our reproduced BERT Super-GLUE scores to scores published in previous work.

A.3 Quasar-T
We only use the subset of the dataset in which the answer is an exact span, as mentioned in our main paper. As this is a non-standard subset, we report our reproduced scores.

A.4 Gigaword
We report the official scores from Pegasus github for "Mixed & Stochastic" model as our baseline. Note that these github scores are higher than those reported in the Pegasus paper.

A.5 Omniglot and MiniImageNet
We report the official scores from the Reptile and MAML papers.

B Links to Source code
For SuperGLUE, we use the Official Implementation for BERT and RoBERTa available at https://github.com/nyu-mll/ jiant, along with the default pre-trained models.
For SQuAD, QUASAR and NewsQA, we used the official implementation and pretrained models at https://github.com/ google-research/bert for BERT and the official pre-trained models from https: //github.com/pytorch/fairseq/ tree/master/examples/roberta for RoBERTa.
For Pegasus, we used the official implementation and "Mixed & Stochastic" pre-trained model weights at https://github.com/ google-research/pegasus. For Omniglot and Mini-Imagenet, we used the official code from Reptile here https://github.com/openai/ supervised-reptile The DotProd Optimizer is trivial to implement in all of the above models following the pseudocode from our the main paper, by modifying the Optimizer class used for each of the models.

D Evaluation Metric code
We used the original evaluation metrics code for all our models, available from source code and datasets linked above. COPA Choice of Possible Alternative, a dataset to classify the cause/effect of a given premise from two alternatives, with fully handcrafted data.

E Dataset Details and Evaluation
MultiRC Multi-Sentence Reading Comprehension, a QA dataset, with a list of multiplechoice possible answers for each question to a paragraph. Evaluated with F1 over all answeroptions(F 1 a ), and exact match of each question's set of answers(EM ).
ReCoRD Reading Comprehension with Commonsense Reasoning, a QA dataset consisting of articles and Cloze-style questions with a masked entity, scored on predicting the masked entity from the entities in the article, with data from CNN and Daily Mail. Scored with token-level F1 and EM.
RTE Recognizing Textual Entailment, as binary classification of entailment or not entailment, with data from Wikipedia and news.
WiC Word-in-Context, a word sense disambiguation (WSD) dataset, tasked with binary classification of sentence pairs based on the sense of a common polysemous word. Data is from WordNet and Wiktionary.
WSC Winograd Schema Challenge, a coreference resolution task on resolving pronouns to a list of noun phrases. As the models we tested only predicted the majority class, we omit this dataset.

E.2 SQuAD v2.0
The Stanford Question Answering Dataset v2.0 is a popular span-style QA dataset, consisting of passages from Wikipedia, labelled by annotators for questions on the passages and corresponding answer spans, along with unanswerable questions as well. This dataset is evaluated with F1 and EM scores of predicted answer spans.

E.3 GigaWord
Gigaword is a summarization dataset, with singleline input documents from news sources, and task is to generate headlines. The dataset is pre-tokenized and number are replaced with #. Evaluation is using ROUGE-1, ROUGE-2 and ROUGE-L (Lin, 2004) metrics.

E.4 Quasar-T
QUASAR-T is a large-scale dataset aimed at evaluating systems designed to comprehend a natural language query and extract its answer from a large corpus of text. It consists of open-domain trivia questions and their answers obtained from various internet sources. We only use those questions whose answers can be extracted as a span for our training and evaluation.

F Label Distributions for datasets
SuperGLUE datasets -The baseline scores by always predicting the most frequent class are 62.3 accuracy for BoolQ, 21.7/48.4 Avg. F1 / Accuracy for CB, 50.0 accuracy for COPA, 61.1/0.3 F1a / EM for MultiRC, 33.4/32.5 F1 / Accuracy for ReCoRD, 50.3 accuracy for RTE, and 50.0 accuracy for WiC. SQuAD v2.0 train set has a total of 130,319 questions of which 43,498 are unanswerable, whereas the dev set has a total of 11,873 questions of which 5,945 are unanswerable. The answer-span location varies across the input.
NewsQA train set has a total of 97,313 questions of which 20,753 are unanswerable, whereas the dev set has a total of 5,456 questions of which 1,115 are unanswerable. The answer-span location varies across the input.
Quasar-T Long train set has 24,499 questions whereas the dev set contains 1,920 questions. The answer-span location varies across the input.
Quasar-T Short train set has 20,533 questions whereas the dev set contains 1,653 questions. The answer-span location varies across the input.