FewshotQA: A simple framework for few-shot learning of question answering tasks using pre-trained text-to-text models

The task of learning from only a few examples (called a few-shot setting) is of key importance and relevance to a real-world setting. For question answering (QA), the current state-of-the-art pre-trained models typically need fine-tuning on tens of thousands of examples to obtain good results. Their performance degrades significantly in a few-shot setting (< 100 examples). To address this, we propose a simple fine-tuning framework that leverages pre-trained text-to-text models and is directly aligned with their pre-training framework. Specifically, we construct the input as a concatenation of the question, a mask token representing the answer span and a context. Given this input, the model is fine-tuned using the same objective as that of its pre-training objective. Through experimental studies on various few-shot configurations, we show that this formulation leads to significant gains on multiple QA benchmarks (an absolute gain of 34.2 F1 points on average when there are only 16 training examples). The gains extend further when used with larger models (Eg:- 72.3 F1 on SQuAD using BART-large with only 32 examples) and translate well to a multilingual setting . On the multilingual TydiQA benchmark, our model outperforms the XLM-Roberta-large by an absolute margin of upto 40 F1 points and an average of 33 F1 points in a few-shot setting (<= 64 training examples). We conduct detailed ablation studies to analyze factors contributing to these gains.


Introduction
The task of question answering (QA) in Natural Language Processing typically involves producing an answer for a given question using a context that contains evidence to support the answer. The latest advances in pre-trained language models resulted in performance close to (and sometimes exceeding) a human performance when fine-tuned on several QA benchmarks , (Brown et al., 2020), (Bao et al., 2020), (Raffel et al., 2020). However, to achieve this result, these models need to be fine-tuned on tens of thousands of examples. In a more realistic and practical scenario, where only a handful of annotated training examples are available, their performance degrades significantly. For instance, (Ram et al., 2021) show that, when only 16 training examples are available, the Robertabase (Liu et al., 2019) and SpanBERT-base (Joshi et al., 2020) obtain a F1 score of 7.7 and 18.2 respectively on SQuAD (Rajpurkar et al., 2016). This is far lower than the F1 score of 90.3 and 92.0 when using the full training set of >100000 examples. Through experimental analysis, we observe that this degradation is majorly attributed to the disparities between fine-tuning and pre-training frameworks (a combination of the input-output design and the training objective). And to address this, we propose a fine-tuning framework (referred to as FewshotQA hereby) that is directly aligned with the pre-training framework, in terms of both the input-output design and the training objective. Specifically, we construct the input as a concatenation of the question, a mask token and context (in that order) and fine-tune a text-to-text pre-trained model using the same objective used during its pre-training to recover the answer. These text-totext pre-trained model(s) were originally trained to recover missing spans of text in a given input sequence. And since our proposed fine-tuning setup is very much identical to the pre-training setup, this enables the model to make the best use of the pre-training "knowledge" for the fine-tuning task of question answering.
The effectiveness of our FewshotQA system is shown in its strong results (an absolute average gain of 34.2 F1 points) on multiple QA benchmarks in a few-shot setting. We show that the gains extend further when used with larger sized models. We also test FewshotQA on a multilingual benchmark by replacing the pre-trained model with its multi-BERT* BART T5 x 1 x 2 m x 4 m x 6 m x 8 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 3 x 5 x 7 m 1 x 2 x 3 m 2 x 5 Context:

Output
Objective Model Input (c) Our proposed FewshotQA fine-tuning framework. Note that the difference between this and the pre-training framework above are the inputs and outputs. 2 Few-shot fine-tuning framework design Our proposed few-shot fine-tuning framework design involves a different choice of input-output design and the training objective than the current standard for QA fine-tuning frameworks. We provide a motivation for this design by comparison with the existing frameworks. Figure 1 illustrates this in detail. The pre-training framework is also pictured for comparison. Note that we focus on the bi-directional masked language models (MLMs) instead of the auto-regressive language models (such as GPT-2 (Radford et al., 2019)) as the MLMs typically are deemed superior for QA tasks , (Lewis et al., 2020). Figure 1a illustrates the comparison between pretraining setups for three types of models. Firstly, there are BERT-style encoder-only models (referred to as BERT * )  that are pre-trained with the standard masked language modeling objective (also called a denoising objective) of predicting the masked tokens in an input sequence I. The masked tokens here typically correspond to a single word or a sub-word. Then, BART (Lewis et al., 2020) uses a corrupted input reconstruction objective to recover the original input.

Pre-training
The corruption involves replacing a span of multiple tokens with a mask token and sentence shuffling. Finally, T5 (Raffel et al., 2020) uses a masked span generation objective to predict the masked spans in an input. The input here is similar to that of BART where multiple spans are replaced with masked tokens. However, instead of generating the full input, only the masked spans are generated. Figure 1b illustrates the fine-tuning setups for each of these models for the task of question answering. The input to both the BERT-style encoder-only models and BART is a concatenation of the question and the context. And both of them use a similar objective that encourages the model to predict the correct start and end positions of the answer in a given input. This is referred to as a span-selection objective. The input to T5 is also the concatenation of the question and the context. However, T5 uses an answer span generation objective to let the model directly generate the answer from scratch.

Aligning the fine-tuning with pre-training
The intuition behind aligning the fine-tuning and pre-training frameworks is that the model can make the best use of the "knowledge" obtained during pre-training phase in the fine-tuning phase. For question answering, the fine-tuning task involves predicting an answer span that could contain multiple tokens. This makes it non-trivial to align BERT * models for QA task during fine-tuning as their pre-training objectives let the model predict only a single word (or a sub-word) for a mask token. Similarly, it would require knowing the answer length in advance to have SpanBERT predict multi-masked tokens. Given that their pre-training objectives naturally involve multi-token span generation, BART and T5 make good candidates for this alignment. We further enhance the alignment by constructing the inputs (and outputs) to be similar to that of what the model sees during pre-training. This is done by appending a mask token (that should correspond to the answer in the target) as part of the input. This framework is illustrated in Figure 1c. We test the effectiveness of our formulation by putting it to test in various few-shot scenarios and observe significant gains.
Overall, we establish that the combination of text-to-text models and fine-tuning framework that is aligned with its pre-training counterpart makes a strong few-shot QA system. We now describe our experimental setup in Section 3.

Modeling details 3.1 Architecture
Our model follows the standard pre-trained textto-text model architecture. It consists of a Transformer-based encoder and decoder models. We default to the "base" versions of the BART and T5 models as they contain a modest number of parameters (140M and 220M respectively) in comparison to the larger sized ones. For T5, we use the T5-V1.1 as the publicly released T5-V1.0 is fine-tuned on downstream tasks including question answering thereby contaminating it for our experimentation. BART-base consists of 6 encoder, 6 decoder layers with a hidden dimension of 768 and T5-base consists of 12 encoder, 12 decoder layers with a hidden dimension of 768 and a feed-forward hidden dimesion of 3072. We call the variants of BART, T5 used with our fine-tuning framework as FewshotBART and FewshotT5 respectively.

Input-output design & fine-tuning objective
The input (x M ) to the model consists of a concatenation of three text sequences. The first (x q ) is a set of question tokens (q) prefixed with the phrase "Question:", the second (x a ) is a mask token (m) prefixed with the phrase "Answer:" and the third (x c ) is a set of context tokens (c) prefixed with the phrase "Context:".
x q = Question: q x a = Answer: m x c = Context: c The target for the BART model (y BART ) is a concatenation of two text sequences, x q and y a where y a is the set of answer tokens prefixed with the phrase "Answer:" y a = Answer: a The target for our T5 variant is a concatenation of a mask token m and y a An example of the input-target pairs from a dataset is shown in Figure 2.
The choice of text-to-text models in our system allows us to use to the standard encoder-decoder objective that maximizes the log likelihood of the text in the ground truth target from the output of the model. Formally, given the input x M and the target y (one of y BART and y T 5 ), the loss function L would now be: Here, X M is the set of inputs, Y is the set of targets, n is the number of tokens in the target sequence. y i is the target token at timestep i. y <i represents all the target tokens preceding the timestep i. P (y i |y <i ) represents the probability of the generating token y i given all the preceding ground truth tokens y <i . And θ represents the parameters of the model.
We chose the order of concatenation in the input as question followed by a mask token followed by context as it enables us to run the generation process for a far fewer number of steps before an answer is generated. (see Generation Strategy section below). The context can be quite long so generating the entire context auto-regressively before generating an answer would be inefficient and cause a degradation in performance.

Generation strategy
During both validation and testing, the model is provided the special start token as input and asked to generate tokens in an auto-regressive manner for a fixed number of steps. For BART, since the question and answer tokens are at the beginning of the sequence in the input and the model is trained to reconstruct the input, we just need to generate until the answer is generated. In practice, for the datasets experimented, the generation length of 50 is sufficient to generate the answer. We stop the generation once these 50 tokens are generated.
For T5, the generation length is set to 25 as only the answer is generated. For both, we use greedy decoding with a beam size = 1.

Answer extraction
Once the outputs are generated, the answers are then extracted via a simple post-processing rule that extracts the answer part of the generation. The use of a fixed input pattern (Question: a Answer: a Context: c) makes this step a simple deterministic one without the need for additional heuristics.

Multilingual extension
Our fine-tuning framework can be extended to a multilingual question answering setting by switching the pretrained model with its multilingual counterpart. We experiment by replacing BART-base model with mBART-50 model that was pre-trained with the same objective as that of BART on a multilingual corpus. The rest of the components, finetuning objective and the answer extraction, remain the same as that of the FewshotBART. We call this model FewshotmBART.

Hyperparameters
We use Adam optimizer with a learning rate of 2e-5. We use a training batch size of 4. We don't use learning rate scheduling. For evaluation on the test set, we pick the best model based off the development set performance. The maximum sequence length is set to the 99th percentile length of all sequence lengths in the development set. We train for a total of 35 epochs or 1000 steps (whichever is the maximum).

Datasets
We follow (Ram et al., 2021) and choose the datasets sampled from the MRQA shared task (Fisch et al., 2019) for our few-shot experiments. We also use the same train and test splits provided in (Ram et al., 2021) for fine-tuning and evaluating our models. However, instead of fine-tuning for a fixed number of iterations, we use a development set to determine the best checkpoint to use for testing. These datasets contain 5000 to 17000 test examples.
Development data split: To cater to a realistic and a practical setting for a few-shot scenario, we pick the development set to be the same size as of the training set. As reported in (Gao et al., 2021), having access to the full development set during training would create an unrealistic few-shot setting. We also make sure there is no overlap between training and development sets. Below, we describe results for several experiments conducted on MRQA few-shot datasets. We run each experiment five times using five different random seeds. And we report the mean and standard deviation of the results for each run.

Comparing the standard vs aligned fine-tuning framework
First, we present results comparing the standard QA span-selection fine-tuning framework (BART) and our proposed fine-tuning framework (Fewshot-BART) that uses an input-output and an objective that is aligned with pre-training framework. We choose BART, two training set sizes (16, 128 examples) to illustrate this and present elaborate results across all configurations in a further section. Both BART and FewshotBART use the base version which contains 140M parameters. As seen in Table 1, our proposed fine-tuning framework improves the F1 score significantly across all datasets in both the 16 example (an absolute gain of upto 48 F1 points and an average of 34.2 F1 points) and 128 example scenarios (an absolute gain of upto 39 F1 points and an average of 30.8 F1 points).

Few-shot results
Next, we present detailed experimental results (Table 2) obtained with our FewshotBART, Fewshot-BARTL, FewshotT5 models on several few-shot configurations across multiple datasets. For Few-shotBARTL, the base model in FewshotBART is replaced with the larger 406M parameter model. We compare our models with RoBERTa, SpanBERT baselines and the recently proposed Splinter (Ram et al., 2021). The RoBERTa and SpanBERT are fine-tuned with the span-selection objective and we use the results for these models from (Ram et al., 2021).
We can see that FewshotBART, FewshotBARTL and FewshotT5 outperform the baselines by a big margin on almost all datasets. A few highlights are listed below: • (a) Our best large model (FewshotBARTL) outperforms all other models by a big margin. Specifically, in a 16 example setting, it provides gains upto 61.2 F1 points in comparison to a similar-sized RoBERTa model that is fine-tuned with a span-selection objective.
• (b) Our best comparable model to Splinter (in terms of model size) -FewshotBART outperforms it by upto 31.6 F1 points in a 16 example setting and upto 10.9 F1 points in a 128 example setting. TextbookQA dataset is one exception where Splinter is stronger.
• (c) FewshotBART is stronger in a 16 example setting in comparison to FewshotT5. This difference starts fading in 32, 64 and 128 example settings. However, FewshotT5 still performs better on most of the datasets in comparison to Splinter in 16, 32 and 64 example settings.

Choice of input-outputs and fine-tuning objectives
In this section, we investigate the impact of changing the input-output design and the fine-tuning objective on the model performance (see Figure  3). Given a question q, answer a and context c, we evaluate the following input-output choices and objectives: Span-selection: This is the standard extractive question answering objective where the model  is made to predict begin and end tokens corresponding to the answer in a given input I: Question: q [S] Context: c Full input generation: This is the objective where the model is made to predict the entire input including the masked answer span. Input: Question: q Answer: <mask>. Context: c Target: Question: q Answer: a. Context: c Question->Answer generation: In this setting, the model is asked to generate only the question and the (masked) answer part of the input. The question tokens are followed by the answer tokens. The context tokens are not included in the fine-tuning objective. Input: Question: q Answer: <mask>. Context: c Target: Question: q Answer: a.
Answer->Question generation: This is similar to the Question->Answer generation objective with the difference being that the answer tokens are followed by the question tokens. Input: Question: q Answer: <mask>. Context: c Target: Question: q Answer: a. Answer generation: This is the generation-based objective equivalent of the standard span-selection objective. The model is asked to generate the answer given an input question and context. Input: Question: q Context: c Target T: a The results are plotted in Figure 3. We choose SQuAD and NewsQA datasets for illustration. There are several key findings here, in the context of a few-shot QA setting (upto 128 examples): • (a) The fine-tuning objectives that are aligned with their pre-training objective (red, green, violet lines) show large gains over the standard span-selection fine-tuning objective (blue line). The answer generation objective (orange line) is superior to a span-selection based objective (blue line) on both the datasets.
• (b) The sequencing of question and answer tokens in the input-output has an impact on  the performance with the specific sequencing of question followed by the answer being superior.
• (c) The Question->Answer generation and Full input generation objectives show strong performance even when there are 16 examples. The gap between the span-selection and other objectives continue to be large even when there are 128 training examples.

Multilingual results
Here, we describe the results of applying Few-shotmBART described in Section to a multilingual corpus TydiQA (Clark et al., 2020a). Ty-diQA consists of question answering datasets from 9 languages (Arabic, Bengali, English, Finnish, Indonesian, Korean, Russian, Swahili, Telugu). We compare this to the results from applying XLM-Roberta-large (Conneau et al., 2020) to the same dataset. We fine-tune XLM-Roberta-large using the standard span-selection objective used in extractive question answering tasks. The results are shown in Figure 4. FewshotmBART outperforms XLM-Robertalarge by an average of 32.96 absolute F1 points for training data sizes spanning from 2 to 64 examples.

Related Work
Question Answering (QA) is an active area of research in Natural Language Processing and the recent advances in pre-trained language models enabled lots of rapid progress in the field , (Brown et al., 2020), (Bao et al., 2020), (Raffel et al., 2020)). QA is also used as a format to cast several NLP problems (McCann et al., 2018), (Chada, 2019). A common way to build a high performing question answering model is to fine-tune these pre-trained models on the entire training dataset -either via a span-extraction objective (Lan et al., 2020), (Clark et al., 2020b), (Bao et al., 2020) or a span-generation objective (Raffel et al., 2020). However, in this work, we explore a more challenging and a practical setting where only a handful of annotated training and development samples are available. Related to this, (Ram et al., 2021) develop a new pretrained model that uses a recurring span selection objective suitable for QA tasks. They then fine-tune this customized pre-trained model on downstream QA tasks using the standard span selection objective. They argue that the existing strategy of fine-tuning large language models fail in a few-shot QA setting. In contrast, we take existing pre-trained text-to-text models BART (Lewis et al., 2020), T5 (Raffel et al., 2020) and simply modify their fine-tuning objective to build a stronger few-shot QA model. As our solution only relies on fine-tuning modifications, we are able to easily extend the framework to larger sized models and multilingual settings without having to build a new pre-trained model each time. An alternative line of work that caters to building question answering models in low-data settings involves dataset synthesis , (Alberti et al., 2019), (Puri et al., 2020).  generate pairs of synthetic context, question and answer triples by sampling context paragraphs from a large corpus of documents. Using these, they generate answer spans, mask the answer and use this cloze-style text to generate natural questions. To do this, they assume access to NLP tools such as named entity recognizer and part of speech tagger. Puri et al. (2020) use a mix of BERT-based answer generation, GPT-2 (Radford et al., 2019) based question generation and and roundtrip filtration to train an extractive QA model. They show promising results with larger scale models. However, the QA model is still fine-tuned on the entire dataset and this entire process including synthetic data generation is computationally expensive. Our work deviates from these by not relying on additional synthetic data, not assuming access to external NLP tools and using only a few training examples for fine-tuning. There is also a connection of our work to some of the recent developments in few-shot learning for classification tasks that cast the problem as a mask-filling problem (Schick and Schütze, 2021), (Gao et al., 2021), (Schick and Schutze, 2020). However, these solutions are geared towards classification tasks with a fixed set of classes. That assumption doesn't hold true for QA tasks.

Conclusion
We present an effective few-shot question answering (QA) system that combines the use of pre-trained text-to-text models and a fine-tuning framework aligned with their pre-training counterpart. Through experimental studies on various QA benchmarks and few-shot configurations, we show that this system can produce significant gains including in scenarios where the training data is extremely scarce (an absolute gain of 34 F1 points on average in comparison to the current standard of the fine-tuning framework). We also present extensions to multilingual and larger model settings and show that the gains translate well to these settings (eg:-up to an absolute 40 F1 point gain in comparison to XLM-Roberta + a span-selection objective). Through ablation studies, we study the impact of model size, fine-tuning objectives, inputoutput design and illustrate the factors leading to such strong gains. For future, as our framework doesn't explicitly enforce the answer to be a span in the input text, it'd be interesting to consider its applications to generative QA tasks.