Few-Shot Question Answering by Pretraining Span Selection

In several question answering benchmarks, pretrained models have reached human parity through fine-tuning on an order of 100,000 annotated questions and answers. We explore the more realistic few-shot setting, where only a few hundred training examples are available, and observe that standard models perform poorly, highlighting the discrepancy between current pretraining objectives and question answering. We propose a new pretraining scheme tailored for question answering: recurring span selection. Given a passage with multiple sets of recurring spans, we mask in each set all recurring spans but one, and ask the model to select the correct span in the passage for each masked span. Masked spans are replaced with a special token, viewed as a question representation, that is later used during fine-tuning to select the answer span. The resulting model obtains surprisingly good results on multiple benchmarks (e.g., 72.7 F1 on SQuAD with only 128 training examples), while maintaining competitive performance in the high-resource setting.


Introduction
The standard approach to question answering is to pretrain a masked language model on raw text, and then fine-tune it with a span selection layer on top (Devlin et al., 2019;Joshi et al., 2020;Liu et al., 2019). While this approach is effective, and sometimes exceeds human performance, its success is based on the assumption that large quantities of annotated question answering examples are available. For instance, both SQuAD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018 and Natural Questions (Kwiatkowski et al., 2019) contain an order of 100,000 question and * Equal contribution. 1 Our code, models, and datasets are publicly available: https://github.com/oriram/splinter. answer pairs in their training data. This assumption quickly becomes unrealistic as we venture outside the lab conditions of English Wikipedia, and attempt to crowdsource question-answer pairs in other languages or domains of expertise (Tsatsaronis et al., 2015;Kembhavi et al., 2017). How do question answering models fare in the more practical case, where an in-house annotation effort can only produce a couple hundred training examples?
We investigate the task of few-shot question answering by sampling small training sets from existing question answering benchmarks. Despite the use of pretrained models, the standard approach yields poor results when fine-tuning on few examples ( Figure 1). For example, RoBERTa-base finetuned on 128 question-answer pairs from SQuAD obtains around 40 F1. This is somewhat expected, since the pretraining objective is quite different from the fine-tuning task. While masked language modeling requires mainly local context around the masked token, question answering needs to align the question with the global context of the pas-Figure 2: An example paragraph before (a) and after (b) masking recurring spans. Each color represents a different cluster of spans. After masking recurring spans (replacing each with a single [QUESTION] token), only one span from each cluster remains unmasked, and is considered the correct answer to the masked spans in the cluster. The pretraining task is to predict the correct answer for each [QUESTION].
sage. To bridge this gap, we propose (1) a novel self-supervised method for pretraining span selection models, and (2) a question answering layer that aligns a representation of the question with the text.
We introduce Splinter (span-level pointer), a pretrained model for few-shot question answering. The challenge in defining such a self-supervised task is how to create question-answer pairs from unlabeled data. Our key observation is that one can leverage recurring spans: n-grams, such as named entities, which tend to occur multiple times in a given passage (e.g., "Roosevelt" in Figure 2). We emulate question answering by masking all but one instance of each recurring span with a special [QUESTION] token, and asking the model to select the correct span for each such token.
To select an answer span for each [QUESTION] token in parallel, we introduce a question-aware span selection (QASS) layer, which uses the [QUESTION] token's representation to select the answer span. The QASS layer seamlessly integrates with fine-tuning on real question-answer pairs. We simply append the [QUESTION] token to the input question, and use the QASS layer to select the answer span ( Figure 3). This is unlike existing models for span selection, which do not include an explicit question representation. The compatibility between pretraining and fine-tuning makes Splinter an effective few-shot learner.
Splinter exhibits surprisingly high performance given only a few training examples throughout a va-riety of benchmarks from the MRQA 2019 shared task (Fisch et al., 2019). For example, Splinter-base achieves 72.7 F1 on SQuAD with only 128 examples, outperforming all baselines by a very wide margin. An ablation study shows that the pretraining method and the QASS layer itself (even without pretraining) both contribute to improved performance. Analysis indicates that Splinter's representations change significantly less during fine-tuning compared to the baselines, suggesting that our pretraining is more adequate for question answering. Overall, our results highlight the importance of designing objectives and architectures in the few-shot setting, where an appropriate inductive bias can lead to dramatic performance improvements.

Background
Extractive question answering is a common task in NLP, where the goal is to select a contiguous span a from a given text T that answers a question Q. This format was popularized by SQuAD (Rajpurkar et al., 2016), and has since been adopted by several datasets in various domains (Trischler et al., 2017;Kembhavi et al., 2017) and languages (Lewis et al., 2020;Clark et al., 2020), with some extensions allowing for unanswerable questions (Levy et al., 2017;Rajpurkar et al., 2018) or multiple answer spans (Dua et al., 2019;Dasigi et al., 2019). In this work, we follow the assumptions in the recent MRQA 2019 shared task (Fisch et al., 2019) and focus on questions whose answer is a single span.
The standard approach uses a pretrained encoder, such as BERT (Devlin et al., 2019), and adds two parameter vectors s, e to the pretrained model in order to detect the start position s and end position e of the answer span a, respectively. The input text T and question Q are concatenated and fed into the encoder, producing a contextualized token representation x i for each token in the sequence.
To predict the start position of the answer span, a probability distribution is induced over the entire sequence by computing the inner product of a learned vector s with every token representation (the end position is computed similarly using a vector e): .
The parameters s, e are trained during fine-tuning, using the cross-entropy loss with the start and end positions of the gold answer span.
This approach assumes that each token representation x i is contextualized with respect to the question. However, the masked language modeling objective does not necessarily encourage this form of long-range contextualization in the pretrained model, since many of the masked tokens can be resolved from local cues. Fine-tuning the attention patterns of pretrained masked language models may thus entail an extensive learning effort, difficult to achieve with only a handful of training examples. We overcome this issue by (1) pretraining directly for span selection, and (2) explicitly representing the question with a single vector, used to detect the answer in the input text.

Splinter
We formulate a new task for pretraining question answering from unlabeled text: recurring span selection. We replace spans that appear multiple times in the given text with a special [QUESTION] token, except for one occurrence, which acts as the "answer" span for each (masked) cloze-style "question". The prediction layer is a modification of the standard span selection layer, which replaces the static start and end parameter vectors, s and e, with dynamically-computed boundary detectors based on the contextualized representation of each [QUESTION] token. We reuse this architecture when fine-tuning on questionanswer pairs by adding a [QUESTION] token at the end of the actual question, thus aligning the pretraining objective with the fine-tuning task. We refer to our pretrained model as Splinter.

Pretraining: Recurring Span Selection
Given an input text T , we find all recurring spans: arbitrary n-grams that appear more than once in the same text. For each set of identical recurring spans R, we select a single occurrence as the answer a and replace all other occurrences with a single [QUESTION] token. 2 The goal of recurring span selection is to predict the correct answer a for a given [QUESTION] token q ∈ R \ {a}, each q thus acting as an independent cloze-style question. Figure 2 illustrates this process. In the given passage, the span "Roosevelt" appears three times. Two of its instances (the second and third) are replaced with [QUESTION], while one instance (the first) becomes the answer, and remains intact. After masking, the sequence is passed through a transformer encoder, producing contextualized to-3069 ken representations. The model is then tasked with predicting the start and end positions of the answer given each [QUESTION] token representation. In Figure 2b, we observe four instances of this prediction task: two for the "Roosevelt" cluster, one for the "Allied countries" cluster, and one for "Declaration by United Nations".
Taking advantage of recurring words in a passage (restricted to nouns or named entities) was proposed in past work as a signal for coreference (Kocijan et al., 2019;Ye et al., 2020). We further discuss this connection in Section 7.
Span Filtering To focus pretraining on semantically meaningful spans, we use the following definition for "spans", which filters out recurring spans that are likely to be uninformative: (1) spans must begin and end at word boundaries, (2) we consider only maximal recurring spans, (3) spans containing only stop words are ignored, (4) spans are limited to a maximum of 10 tokens. These simple heuristic filters do not require a model, as opposed to masking schemes in related work (Glass et al., 2020;Ye et al., 2020;Guu et al., 2020), which require partof-speech taggers, constituency parsers, or named entity recognizers.
Cluster Selection We mask a random subset of recurring span clusters in each text, leaving some recurring spans untouched. Specifically, we replace up to 30 spans with [QUESTION] from each input passage. 3 This number was chosen to resemble the 15% token-masking ratio of Joshi et al. (2020). Note that in our case, the number of masked tokens is greater than the number of questions.

Model: Question-Aware Span Selection
Our approach converts texts into a set of questions that need to be answered simultaneously. The standard approach for extractive question answering (Devlin et al., 2019) is inapplicable, because it uses fixed start and end vectors. Since we have multiple questions, we replace the standard parameter vectors s, e with dynamic start and end vectors s q , e q , computed from each [QUESTION] token q: Here, S, E are parameter matrices, which extract ad hoc start and end position detectors s q , e q from the given [QUESTION] token's representation x q .
The rest of our model follows the standard span selection model by computing the start and end position probability distributions. The model can also be viewed as two bilinear functions of the question representation x q with each token in the sequence x i , similar to Dozat and Manning (2017): Finally, we use the answer's gold start and end points (s a , e a ) to compute the cross-entropy loss: We refer to this architecture as the question-aware span selection (QASS) layer.

Fine-Tuning
After pretraining, we assume access to labeled examples, where each training instance is a text T , a question Q, and an answer a that is a span in T . To make this setting similar to pretraining, we simply append a [QUESTION] token to the input sequence, immediately after the question Q (see Figure 3). Selecting the answer span then proceeds exactly as during pretraining. Indeed, the advantage of our approach is that in both pretraining and fine-tuning, the [QUESTION] token representation captures information about the question that is then used to select the span from context.

A Few-Shot QA Benchmark
To evaluate how pretrained models work when only a small amount of labeled data is available for finetuning, we simulate various low-data scenarios by sampling subsets of training examples from larger datasets. We use a subset of the MRQA 2019 shared task (Fisch et al., 2019), which contains extractive question answering datasets in a unified format, where the answer is a single span in the given text passage. Split I of the MRQA shared task contains 6 large question answering datasets: SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2017), Triv-iaQA (Joshi et al., 2017), SearchQA (Dunn et al., 2017), HotpotQA (Yang et al., 2018), and Natural Questions (Kwiatkowski et al., 2019). For each dataset, we sample smaller training datasets from the original training set with sizes changing on a logarithmic scale, from 16 to 1,024 examples. To reduce variance, for each training set size, we sample 5 training sets using different random seeds and report average performance across training sets. We also experiment with fine-tuning the models on the full training sets. Since Split I of the MRQA shared task does not contain test sets, we evaluate using the official development sets as our test sets.
We also select two datasets from Split II of the MRQA shared task that were annotated by domain experts: BioASQ (Tsatsaronis et al., 2015) and TextbookQA (Kembhavi et al., 2017). Each of these datasets only has a development set that is publicly available in MRQA, containing about 1,500 examples. For each dataset, we sample 400 examples for evaluation (test set), and follow the same protocol we used for large datasets to sample training sets of 16 to 1,024 examples from the remaining data.
To maintain the few-shot setting, every dataset in our benchmark has well-defined training and test sets. To tune hyperparameters, one needs to extract validation data from each training set. For simplicity, we do not perform hyperparameter tuning or model selection (see Section 5), and thus use all of the available few-shot data for training.

Experimental Setup
We describe our experimental setup in detail, including all models and baselines.
In all experiments, we compare Splinter-base to three baselines of the same capacity: RoBERTa (Liu et al., 2019) A highly-tuned and optimized version of BERT, which is known to perform well on a wide range of natural language understanding tasks.
SpanBERT (Joshi et al., 2020) A BERT-style model that focuses on span representations. Span-BERT is trained by masking contiguous spans of tokens and optimizing two objectives: (a) masked language modeling, which predicts each masked token from its own vector representation; (b) the span boundary objective, which predicts each masked token from the representations of the unmasked tokens at the start and end of the masked span.
SpanBERT (Reimpl) Our reimplementation of SpanBERT, using exactly the same code, data, and hyperparameters as Splinter. This baseline aims to control for implementation differences and measures the effect of replacing masked language modeling with recurring span selection. Also, this version does not use the span boundary objective, as Joshi et al. (2020) reported no significant improvements from using it in question answering.

Pretraining Implementation
We train Splinter-base using Adam (Kingma and Ba, 2015) for 2.4M training steps with batches of 256 sequences of length 512. 4 The learning rate is warmed up for 10k steps to a maximum value of 10 −4 , after which it decays linearly. As in previous work, we use a dropout rate of 0.1 across all layers.
We follow Devlin et al. (2019) and train on English Wikipedia (preprocessed by WikiExtractor as in Attardi (2015)) and the Toronto BookCorpus (Zhu et al., 2015). We base our implementation on the official TensorFlow implementation of BERT, and train on a single eight-core v3 TPU (v3-8) on the Google Cloud Platform.

Fine-Tuning Implementation
For fine-tuning, we use the hyperparameters from the default configuration of the HuggingFace Transformers package (Wolf et al., 2020). 5 Specifically, we train all models using Adam (Kingma and Ba, 2015) with bias-corrected moment estimates for few-shot learning (Zhang et al., 2021). When finetuning on 1024 examples or less, we train for either 10 epochs or 200 steps (whichever is larger). For full-size datasets, we train for 2 epochs. We set the batch size to 12 and use a maximal learning rate of 3 · 10 −5 , which warms up in the first 10% of the steps, and then decays linearly.
An interesting question is how to fine-tune the QASS layer parameters (i.e., the S and E matrices in Section 3.2). In our implementation, we chose to discard the pretrained values and fine-tune from a random initialization, due to the possible discrepancy between span statistics in pretraining and fine-tuning datasets. However, we report results on fine-tuning without resetting the QASS parameters as an ablation study (Section 6.3).

Results
Our experiments show that Splinter dramatically improves performance in the challenging few-shot setting, unlocking the ability to train question answering models with only hundreds of examples. When trained on large datasets with an order of 100,000 examples, Splinter is competitive with (and often better than) the baselines. Ablation studies demonstrate the contributions of both recurring span selection pretraining and the QASS layer.  Table 1 shows the performance of individual models when given 16, 128, and 1024 training examples across all datasets (see Table 3 in the appendix for additional performance and standard deviation statistics). It is evident that Splinter outperforms all baselines by large margins. Let us examine the results on SQuAD, for example. Given 16 training examples, Splinter obtains 54.6 F1, significantly higher than the best baseline's 18.2 F1. When the number of training examples is 128, Splinter achieves 72.7 F1, outperforming the baselines by 17 points (our reimplementation of SpanBERT) to 30 (RoBERTa). When considering 1024 examples, there is a 5-point margin between Splinter (82.8 F1) and SpanBERT (77.8 F1). The same trend is seen in the other datasets, whether they are in-domain sampled from larger datasets (e.g. TriviaQA) or not; in TextbookQA, for instance, we observe absolute gaps of 9 to 23 F1 between Splinter and the next-best baseline.

High-Resource Regime
Table 1 also shows the performance when finetuning on the entire training set, when an order of 100,000 examples are available. Even though Splinter was designed for few-shot question answering, it reaches the best result in five out of six datasets. This result suggests that when the target task is extractive question answering, it is better to pretrain with our recurring span selection task than with masked langauge modeling, regardless of the number of annotated training examples.

Ablation Study
We perform an ablation study to better understand the independent contributions of the pretraining scheme and the QASS layer. We first ablate the effect of pretraining on recurring span selection by applying the QASS layer to pretrained masked language models. We then test whether the QASS layer's pretrained parameters can be reused in Splinter during fine-tuning without reinitializion.

Independent Contribution of the QASS Layer
While the QASS layer is motivated by our pretraining scheme, it can also be used without pretraining. We apply a randomly-initialized QASS layer to our implementation of SpanBERT, and fine-tune it in the few-shot setting. Figure 5 shows the results of this ablation study for two datasets (see Figure 7 in the appendix for more datasets). We observe  that replacing the static span selection layer with QASS can significantly improve performance on few-shot question answering. Having said that, most of Splinter's improvements in the extremely low data regime do stem from combining the QASS layer with our pretraining scheme, and this combination still outperforms all other variants as the amount of data grows.

QASS Reinitialization
Between pretraining and fine-tuning, we randomly reinitialize the parameters of the QASS layer. We now test the effect of fine-tuning with the QASS layer's pretrained parameters; intuitively, the more similar the pretraining data is to the task, the better the pretrained layer will perform. Figure 5 shows that the advantage of reusing the pretrained QASS is datadependent, and can result in both performance gains (e.g. extremely low data in SQuAD) and stagnation (  compatibility between pretraining and fine-tuning tasks; the information learned in the QASS layer is useful as long as the input and output distribution of the task are close to those seen at pretraining time.

Analysis
The recurring span selection objective was designed to emulate extractive question answering using unlabeled text. How similar is it to the actual target task? To answer this question, we measure how much each pretrained model's functionality has changed after fine-tuning on 128 examples of SQuAD. For the purpose of this analysis, we measure change in functionality by examining the vector representation of each token as produced by the transformer encoder; specifically, we measure the cosine similarity between the vector produced by  Figure 5: Ablation studies on SQuAD and BioASQ datasets. We examine the role of the QASS layer by fine-tuning it on top of our reimplementation of SpanBERT. In addition, we test whether it is beneficial to keep the pretrained parameters of the QASS layer when fine-tuning Splinter.
the pretrained model and the one produced by the fine-tuned model, given exactly the same input. We average these similarities across every token of 200 examples from SQuAD's test set. Table 2 shows that Splinter's outputs are very similar before and after fine-tuning (0.89 average cosine similarity), while the other models' representations seem to change drastically. This suggests that fine-tuning with even 128 questionanswering examples makes significant modifications to the functionality of pretrained masked language models. Splinter's pretraining, on the other hand, is much more similar to the fine-tuning task, resulting in much more modest changes to the produced vector representations.

Related Work
The remarkable results of GPT-3 (Brown et al., 2020) have inspired a renewed interest in few-shot learning. While some work focuses on classification tasks (Schick and Schütze, 2020;Gao et al., 2021), our work investigates few-shot learning in the context of extractive question answering.
One approach to this problem is to create synthetic text-question-answer examples. Both  and Glass et al. (2020) use the traditional NLP pipeline to select noun phrases and named entities in Wikipedia paragraphs as potential answers, which are then masked from the context to create pseudo-questions.  use methods from unsupervised machine translation to translate the pseudo-questions into real ones, while Glass et al. (2020) keep the pseudo-questions but use information retrieval to find new text passages that can answer them. Both works assume access to language-and domain-specific NLP tools such as part-of-speech taggers, syntactic parsers, and named-entity recognizers, which might not always be available. Our work deviates from this approach by exploiting the natural phenomenon of recurring spans in order to generate multiple question-answer pairs per text passage, without assuming any language-or domain-specific models or resources are available beyond plain text.
Similar ideas to recurring span selection were used for creating synthetic coreference resolution examples (Kocijan et al., 2019;Varkel and Globerson, 2020), which mask single words that occur multiple times in the same context. CorefBERT (Ye et al., 2020) combines this approach with a copy mechanism for predicting the masked word during pretraining, alongside the masked language modeling objective. Unlike our approach, which was designed to align well with span selection, CorefBERT masks only single-word nouns (rather than arbitrary spans) and replaces each token in the word with a separate mask token (rather than a single mask for the entire multi-token word). Therefore, it does not emulate extractive question answering. We did not add CorefBERT as a baseline since the performance of both CorefBERT-base and CorefBERT-large was lower than SpanBERTbase's performance on the full-data MRQA benchmark, and pretraining CorefBERT from scratch was beyond our available computational resources.

Conclusion
We explore the few-shot setting of extractive question answering, and demonstrate that existing methods, based on fine-tuning large pretrained language models, fail in this setup. We propose a new pretraining scheme and architecture for span selection that lead to dramatic improvements, reaching surprisingly good results even when only an order of a hundred examples are available. Our work shows that choices that are often deemed unimportant when enough data is available, again become crucial in the few-shot setting, opening the door to new methods that take advantage of prior knowledge on the downstream task during model development.