Few-shot Unified Question Answering: Tuning Models or Prompts?

Question-answering (QA) tasks often investigate specific question types, knowledge domains, or reasoning skills, leading to specialized models catering to specific categories of QA tasks. While recent research has explored the idea of unified QA models, such models are usually explored for high-resource scenarios and require re-training to extend their capabilities. To overcome these drawbacks, the paper explores the potential of two paradigms of tuning, model, and prompts, for unified QA under a low-resource setting. The paper provides an exhaustive analysis of their applicability using 16 QA datasets, revealing that prompt tuning can perform as well as model tuning in a few-shot setting with a good initialization. The study also shows that parameter-sharing results in superior few-shot performance, simple knowledge transfer techniques for prompt initialization can be effective, and prompt tuning achieves a significant performance boost from pre-training in a low-resource regime. The research offers insights into the advantages and limitations of prompt tuning for unified QA in a few-shot setting, contributing to the development of effective and efficient systems in low-resource scenarios.


Introduction
Question answering (QA) is a pivotal area of research in NLP that evaluates the language understanding and reasoning capabilities of language models.To this end, the NLP community has developed numerous QA datasets that span various domains, question-answer formats, and reasoning skills (Rogers et al., 2022).Consequently, there is an increasing demand for a Unified QA system that can manage mixed batches of instances from different datasets and tasks during training and inference (Liu et al., 2022).Such a system would eliminate the need for manual tuning or per-task adjustments,  (Asai et al., 2022) (a complex prompt transfer learning approach) for Unified QA on target QA datasets in a 32-shot scenario using T5-Base as the backbone model.The results show that prompt-tuning with prior outperforms multi-task full-model fine-tuning and ATTEMPT does not provide any additional advantage, especially for out-of-domain target tasks.Here IID has been used to refer to In-Distribution while OOD refers to out-ofdistribution.
enabling seamless integration of new datasets.This would contribute to the development of efficient QA models with minimal computational and storage costs, enhanced generalization capabilities, and greater practicality for real-world use cases.
The success of transformer-based models in textto-text generation has led to a growing interest in Unified QA systems.Khashabi et al. (2020) proposed Unified-QA, a single QA model pretrained on diverse datasets that outperforms formatspecialized models.While prompt-tuning methods (Lester et al., 2021;Vu et al., 2022) have emerged as a promising alternative to fine-tuning, (Zhong et al., 2022a) proposed to model the commonal-ities and distinguish task differences through a structurally designed prompt-based input schema.However, these approaches have limitations related to scalability, expensive pre-training requirements, and the need for tens of thousands of training examples for each task.Moreover, the performance of pre-trained QA models significantly degrades when only a few question-answering examples are available (Ram et al., 2021).While Unified QA approaches have shown success in high-data scenarios, their efficacy in more practical scenarios with limited training examples remains unexplored.
This paper aims to explore the potential of two different paradigms of tuning, model, and prompts, for unified question answering under a low resource setting.Despite the importance of this problem, there have been no previous studies investigating the effectiveness of these paradigms for this task.In response, we conduct an exhaustive analysis of the applicability of these two paradigms to a unified question-answering system.To do so, we evaluate their promise, effectiveness, and trade-offs using a set of 16 QA datasets, covering diverse domains and a wide range of skills and formats.
Our empirical study reveals several key findings, including (i) prompt tuning with good initialization can outperform model tuning under a low resource regime for out-of-distribution tasks, (ii) parameter-sharing results in superior few-shot performance, but the trends are reversed in the fullshot setting, (iii) simple knowledge transfer techniques for prompt initialization can be as effective as more complex methods in the few-shot setting, without introducing additional parameters, and (iv) prompt tuning achieves a significant performance boost from pre-training in a low resource regime while increasing model size does not significantly affect prompt tuning with initialization.In addition, we perform a systematic quantitative and qualitative study to provide insights into the advantages and limitations of prompt tuning for unified QA with an emphasis on the behaviors in the few-shot setting.Overall, our research aims to contribute to the development of effective and efficient unified question-answering systems in low-resource scenarios.

Related Work
Parameter-efficient tuning.Large-scale pretrained language models fine-tuned on specific target datasets have shown remarkable performance for several downstream tasks in NLP (Devlin et al., 2019;Liu et al., 2019;Raffel et al., 2022;Brown et al., 2020;He et al., 2021b;Lan et al., 2019;Yang et al., 2019).However, standard fine-tuning approaches update all the model parameters, which can often lead to deployment challenges.Recent research (Houlsby et al., 2019;He et al., 2021c;Lester et al., 2021;Li and Liang, 2021a) has shown that similar performance can be obtained by updating or adding a few trainable parameters while keeping pre-trained language model parameters frozen.Several approaches have been proposed in this direction: Adapter-based methods (Houlsby et al., 2019;mahabadi et al., 2021;Rücklé et al., 2021) insert small trainable feed-forward networks (modules) between layers of pre-trained language models while BitFit (Ben Zaken et al., 2022) updates only the language model biases.Another computationally efficient approach is prompttuning (Lester et al., 2021) and prefix-tuning (Li and Liang, 2021a), which concatenate trainable continuous embeddings to the input.These trainable parameters, called soft prompts, can be used as plug-ins with a frozen LM to capture task-specific, domain-specific, or language-specific knowledge.He et al. (2021a) presents a unified view of different parameter-efficient training (PET) approaches.
Multi-task transfer learning.Efficient task transferability in NLP has been extensively studied (Wang et al., 2019;Liu et al., 2021a;Vu et al., 2020Vu et al., , 2021)).With T5 (Raffel et al., 2022) demonstrating the capabilities of using existing downstream task datasets to learn a new task, proposing efficient methodologies for unifying NLP models has become a promising research paradigm in the community.Following this, (Khashabi et al., 2020) proposed UnifiedQA, a single QA model pre-trained on datasets involving diverse formats and reasoning skills.Transfer learning has been demonstrated to be effective from rich data sources (Phang et al., 2018), between similar target tasks (Vu et al., 2020), and for tasks that require similar reasoning skills (Pruksachatkun et al., 2020).However, this approach would require updating/retraining the model on a new task or a different domain, which could lead to catastrophic forgetting (Kirkpatrick et al., 2017).Moreover, Aghajanyan et al. (2021) showed approaches towards unifying NLP models suffer from negative interference to less represented tasks and between dissimilar tasks.
Most recently, Liu et al. (2022) validates that parameter-efficient tuning methods can perform well with mixed task batches.Zhong et al. (2022b) takes the first step towards building unified QA models utilizing structural prompt tuning.Along these lines, Vu et al. (2022); Asai et al. (2022) integrates both the paradigms of parameter-efficient tuning and unifying NLP models to propose a single pre-trained model for different downstream tasks by learning target task-specific prompts from the source task prompts.Asai et al. (2022) demonstrates transfer using the attention module, while Vu et al. (2022) facilitates prompt transfer by learning the target prompt initialized from similar source prompts.These approaches require fewer than 0.1% of trainable LM parameters with little tradeoff in performance.
Few-shot question answering.Ram et al. (2021) has identified a discrepancy between current pretraining objectives and QA, as standard models perform poorly when fine-tuned with few examples.They propose recurring span selection as a pretraining scheme tailored for question answering.Chada and Natarajan (2021), on the other hand, proposes a fine-tuning framework aligned with the pretraining framework.However, there have been no studies focusing on the viability of prompt tuning for unified QA under low-resource settings.To address this gap, we follow prior works, (Liu et al., 2022;Asai et al., 2022;Khashabi et al., 2020), and extensively study the viability and trade-offs of prompt tuning and prompt-based transfer learning in comparison to approaches that involve full-model fine-tuning for few-shot unified QA.As a result of our comprehensive experiments, we offer essential guidelines in the form of valuable insights into the advantages and limitations of prompt tuning with respect to model tuning for unified QA in both full and fewshot scenarios.
3 Candidates for universal QA approach Finetuning pre-trained language models (FT) on specific datasets yields specialized models that cater to individual tasks.However, a more efficient approach is to build a unified QA model that can perform multiple tasks without manual tuning or per-task adjustments.One of the significant advantages of such approaches is that they seamlessly support mixed-task batch inference (Liu et al., 2022), where a single model can handle diverse tasks, reducing computation, storage, and maintenance costs.This study seeks to assess the suitability of two prevalent training paradigms for NLP, namely model-tuning and prompt-tuning, as potential approaches for developing a unified questionanswering (QA) model.Our investigation centers around four essential criteria we look for in an effective unified QA model: (1) the ability to utilize a single model to address a range of different QA tasks, (2) effective knowledge transfer from multiple relevant tasks, (3) while minimizing the risk of negative interference, and (4) extensibility to new tasks without requiring expensive retraining.In this study, our goal is to investigate the potential of soft prompt-tuning extensively and to better understand its benefits and drawbacks in comparison with model-tuning-based approaches for building a unified QA system grounded on the aforementioned four principles.In particular, we further center the study around understanding these tradeoffs in the few-shot learning scenarios, which is a realistic and more practical challenge.
Model-tuning This paradigm involves the finetuning of all the parameters of a language model to cater to a specific task or a set of tasks.Although fine-tuning (FT) on a particular dataset is an effective strategy, it is not suitable for unified QA because it requires specialized models for each dataset during inference, which is counter-intuitive to the concept of a unified QA model.
In contrast, multi-task learning via fine-tuning (FT-MT) (Raffel et al., 2022;Aribandi et al., 2021) involves the joint learning of a single model on multiple datasets by sharing all the trainable model parameters across different tasks.By training on multiple datasets, FT-MT allows for knowledge transfer from relevant tasks during inference.However, sharing all the parameters often leads to negative transfer from unrelated tasks.Incorporating additional tasks into existing models requires retraining the model with all previous tasks and the new ones, making them computationally expensive to scale and more prone to negative interference.
Prompt-tuning This paradigm involves learning soft-prompt tokens added to the input while the backbone language model remains frozen.We follow the approach proposed by Lester et al. (2021) to train soft prompts for each task, where prompts are initialized from random words in the vocabulary (PT-R).This vanilla prompt-tuning approach is parameter-efficient and easy to scale.Since taskspecific knowledge is captured in a different set of parameters (i.e., the prompts), this approach avoids negative interference to a great extent.With a single backbone model, we can use these prompts for different tasks.However, this approach does not leverage knowledge from other tasks not already captured in the backbone model.
Prompt initialization is a technique that addresses the issue of knowledge transfer from source tasks in vanilla prompt-tuning while retaining the benefits of a single model, minimal negative transfer, and extensibility.Previous studies (Li and Liang, 2021b;Liu et al., 2023;Vu et al., 2022) have shown that prompt-tuning methods are often sensitive to initialization, particularly in low data settings.However, the impact of different initialization methods on QA datasets has not been well studied.Inspired by (Vu et al., 2022), we initialize the target prompt by taking the average of the top-3 source task prompts most similar to the prompt trained on the target dataset.We employ two distinct approaches to this initialization process: (i) selecting source task prompts with the same answer format as that of the target dataset (PT-F), and (ii) selecting source task prompts from the complete set of source prompts (PT-C).
Apart from prompt initialization, another way to transfer knowledge from multiple tasks is through the composition of their corresponding prompts.To this end, Asai et al. (2022) proposes AT-TEMPT, a transfer learning method that learns new task-specific target prompts by computing weighted combinations of source prompts using a sub-network-based attention module trained on a single or set of tasks.We distinguish between two settings: ATT-MT, where attention modules are shared across tasks and trained in a multi-task manner, and ATT-ST, where attention module parameters are not shared.While ATT-MT provides a single model for transferring knowledge from source prompts and is easily scalable to new target tasks, sharing attention modules across tasks may result in some negative transfer, compared to more straightforward prompt-tuning methods.

Datasets
In their recent study, Rogers et al. (2022) highlight a significant increase in the number of questionanswering and reading comprehension datasets, spanning various domains, formats, and reasoning abilities.This study aims to evaluate and finetune a range of models, leveraging a collection of datasets referred to as "source datasets" for pretraining, and a distinct set of datasets known as "target datasets" for evaluation.This paper includes datasets that cover a wide range of reasoning skills and complex linguistic phenomena, including conversational, temporal, causal, and coreference reasoning, among others, enabling a more comprehensive evaluation of training paradigms on question-answering datasets and facilitating analysis of cross-skill transfer.This broader coverage across reasoning skills not only enables a more thorough evaluation of training paradigms on QA datasets but also facilitates analysis of cross-skill transfer.Table 3,4 presents an overview of the datasets employed in our study, detailing their size, domain, and associated primary reasoning skill.
Source Datasets.This study leverages source datasets for two primary purposes: pre-training models through model tuning and training source prompts via prompt-tuning approaches.The source datasets employed in our research comprise over 30,000 training instances.They aim to encompass essential reasoning skills such as reading comprehension, conversational and commonsense reasoning, as well as discrete and numerical reasoning necessary for question answering.Source datasets cover a wide range of domains, including knowledge bases, news, web documents, and Wikipedia.
Target Datasets.We employ target datasets to fine-tune models using the model-tuning paradigm, or to train target prompts for prompt-tuning approaches.Target datasets are typically small in size, containing fewer training instances.In-Distribution : Includes datasets that share the  (Dunn et al., 2017) J! Archive EXT (RC) NewsQA (Trischler et al., 2017) news articles EXT TriviaQA (Joshi et al., 2017) Web, Wikipedia EXT Natural Questions (Kwiatkowski et al., 2019) Wikipedia EXT NQOpen (Lee et al., 2019) Wikipedia ABS RACE (Lai et al., 2017) Exams MCQ DuoRC (Saha et al., 2018) movie plots PubMedQA (Jin et   Out-of-Distribution : Includes datasets that lack domain knowledge and reasoning skills found in one or more of the source datasets.This subset includes datasets that involve intricate and specialized reasoning abilities such as temporal commonsense, causal reasoning, logical and inferential reasoning, as well as datasets specific to domains like Twitter, TOEFL, law books, and personal narratives.These target datasets can benefit from generic source tasks. In some contexts, certain tasks require multiple types of reasoning.For instance, the ShARC dataset necessitates a combination of conversational and causal reasoning, while the COPA dataset entails the application of commonsense causal reasoning.Therefore, natural language processing models may face additional challenges in performing these tasks due to the integration of multiple reasoning skills.To assess the effectiveness of a unified QA system, we perform experiments on the test set of the target datasets.(Tafjord et al., 2018) science, etc. MCQ COPA (Gordon et al., 2012) personal stories MCQ

Experiments
We employ the T5-base model for all our experiments, unless stated otherwise.Source prompts are trained independently for each task, while the pre-trained language model (PrLM) and attention modules for ATTEMPT are trained jointly on all the source tasks.For target datasets, we randomly select a small number of instances for few-shot training and evaluation.The hyperparameters for training are presented in section 5.1.lection.We report the aggregate results of three seeds.Table 5 summarizes the experimental results comparing the model-tuning and prompt-tuning paradigms for a unified QA system.In the rest of this section, we share our key findings and insights that can hopefully help guide which paradigm to prefer under which scenarios.

Hyper-parameters
After extensive tuning, we selected a learning rate of 1e-5 for the backbone model, along with a maximum source length of 512, a gradient accumulation step of 2, and a batch size of 16.During training, we saved and evaluated checkpoints every 500 steps, and trained the model for 100K steps with patience.For all experiments, the prompts consisted of k = 100 tokens with a hidden dimension of d = 768.
Best Candidate for Unified QA FT-MT, PT-R, PT-F, PT-C, ATT-MT are potential candidates for the unified question answering task.In low-resource scenarios, all candidates perform similarly, but PT-F and PT-C stands out due to its low number of trainable parameters and ease of scaling to new datasets.As the number of training instances increases, FT-MT outperforms other approaches, while prompt-tuning approaches remain competitive.Our findings suggest that a simple approach like PT-F is on par with more sophisticated prompttransfer learning approaches like ATT-MT.Format-based prompt initialization achieves comparable performance to more complex prompt-transfer approaches.The Prompttuning paradigm has emerged as a highly effective approach for fine-tuning pre-trained language models for specific tasks.However, it has been shown that the success of this paradigm can be highly sensitive to initialization.To address this issue, we drew inspiration from the work of (Vu et al., 2022) and explored the use of two different initialization techniques for the target prompt (PT-F and PT-C).Our results demonstrated that both initialization techniques outperformed random initialization by 6% with 32 examples, and this gap increased to approximately 20% with 1024 examples.Notably, we found that the simpler format-based heuristic initialization was just as effective as the more complex cosine-based search over the entire prompt pool.Furthermore, our results revealed that both prompt initialization approaches were competitive with the sophisticated attention-module-based prompt-transfer learning approach ATT-MT.
Our analysis further revealed that the performance of PT-F and PT-C varied based on the skill or domain of the dataset (see Table 10).Evaluation on datasets from specific domains (Figure 9) reveals that in low-regime scenarios, PT-F outperformed PT-C in the Web+Social and domainspecific book domains, while PT-C was more effective for Knowledge graphs and Wikipedia domains.However, in high-range scenarios, all models performed similarly.Furthermore, our analysis from a skill perspective, as depicted in Figure 10, indicated that PT-F performed better in Dialog reasoning in the low range and in commonsense reasoning in the high range.On the other hand, PT-C was better suited for causal reasoning in the low range.More detailed information on our findings can be found in the appendix in Table 10.We find that PT-C and PT-F have the highest agreement scores.We partly attribute this to the high overlap of initialization prompts of format and logically similar tasks (PT-C, PT-C).However, as the number of shots increases the overall agreement decreases across different modes.Furthermore, we investigate if different modes can be complementary to each other by evaluating the union of their predictions across different shots.We find that fine-tuning (FT) and model tuning models (FT-MT) are complementary to each other at low resource settings whereas the gains from PT-R to other modes are minimum.For the complete results, refer to Appendix (Figure 3).This might indicate that prompt tuning may not be practical without good initialization for extremely low-resource QA scenarios.For further discussions around few shot analysis, refer to Appendix 6.2.
A closer look at the task-level performance across different few-shot settings reveals counter-intuitive behaviors.We find that under low resource settings (< 256 shot) good initialization helps significantly for target tasks that are similar to source tasks (e.g: OBQA, BoolQ, IIRC), and the performance gain decreases as we increase the number of shots.As seen from Figure 8, for similar tasks PT-C and model tuning FT-MT performed significantly better than PT-R.However, in cases where there is little domain overlap (ShaRC), initializations do not contribute substantially to the overall performance of the model.Interestingly, in some cases, we find counter-intuitive results where performance remains flat (ShaRC) from Figure 8) or zig-zag (Ropes) pattern is observed across different shots.We point the reader to Appendix (Figures 4,5,8) for performance across different modes against different shots.

Qualitative Study
Table 8 presents a few qualitative examples across different shots and modes.We find prompt tuning with good initialization to leverage world knowledge better (e.g: Arctic Circle with cold weather) even in low resource settings while prompt tuning struggles in predicting local context-based reasoning tasks (e.g: taking photos of home does not associate with new home).
Do the same model across different few-shot settings agree on its answers? Figure ?? presents the overall agreement of different models for a single task under different shot settings.We observe patterns of high level agreement between adjacent shots that gradually decrease with an increase in the number of shots in fine-tuning and prompt tuning with initialization mode.However, prompt tuning with random initialization has an agreement percentage of 50% across different shots and has no clear distinction of high agreement between the adjacent shots as found in other settings.

Conclusion
In this work, we explore the viability of prompttuning as a solution to unified QA and conduct a thorough analysis of its promise, effectiveness, and trade-offs compared with the model-tuning paradigm on a set of 16 QA datasets, focusing particularly on several few-shot scenarios.As a result, we obtain several key findings and insights that hopefully will inform which paradigm to prefer under which scenarios.Prompt tuning is quite competitive with model-tuning in the lower extreme of the few-shot scenarios, given a good initialization.
While parameter-sharing leads to superior performance in the few-shot setting, the trends flip in the full-shot setting, A simple knowledge transfer approach (i.e., an average of relevant prompts) is as effective as complex methods without introducing additional parameters.Pre-training the backbone model on the source tasks significantly benefits prompt tuning.While initializing from a strong prior is very helpful for prompt tuning, its benefit is not as substantial when using a larger backbone model, especially when the number of training examples exceeds a certain threshold.

Limitations
Our work has several limitations: (1) since fewshot experiments are prone to have considerable variance due to the randomly sampled few training examples, we repeat all the experiments using three randomness seeds for the T5-base backbone.However, since the number of experiments per seed is more than 1500, we were able to run the same experiments with a T5-large backbone using only one seed and excluding specific few-shot settings due to computational limitations, especially given the latter model has 3.5 times more parameters.Although our comparisons of the two models are still presented in an entirely fair fashion using the same single seed, it would have been more strongly conclusive to test our findings with a T5base backbone on the larger model to the same extent.That is also the reason why the current version of our study does not include comparisons with even larger models such as T5-3b or T5-11b.
(2) We explore a limited number of prompt-tuning methods both in terms of how the soft prompts are injected in the model architecture following (Lester et al., 2021) and how the knowledge from source tasks are used to inform target tasks following (Vu et al., 2022;Asai et al., 2022).For example, Liu (2022) proposes a parameter-efficient fine-tuning alternative to soft prompt-tuning in recent work, while (Zhong et al., 2022a) shows the benefits of prompt-based pretraining.Although the key takeaways in the current version of our study are supported by sufficient empirical evidence, incorporating the aforementioned recent developments may prove even further promise and evidence for the prompt-based approaches towards few-shot unified QA.
(3) Our study is currently limited to English-QA datasets, hindering our findings to be generally valid for cross-lingual and/or cross-model questionanswering systems.Therefore, we need to consider how our findings would generalize to other languages and modalities.

Ethical Statement
We observe a preference for multiple-choice question (MCQ) answer formats across various question-answering (QA) datasets with varying levels of reasoning ability.Additionally, the majority of the source datasets were sourced from Wikipedia, which may contain gender or political bias that could be further perpetuated by models.The T5 model, which was used for pre-training, may also have biases due to its pre-training data.However, the study did not conduct stress tests to identify potential biases, and users should be cautious when implementing the provided models.
The current models' results may not align with the facts in input documents, potentially leading to the spread of false information online.This is a common issue in all current QA models, and further research is needed in this area.The study's experiments were primarily conducted using A100 GPUs and consumed a significant amount of GPU time when repeated across random seeds.Nevertheless, our findings can benefit subsequent studies and applications by providing valuable insights, thus avoiding the need for extensive repetition of these comparisons.

A Appendix
A       Recent studies have shown that the performance gap between prompt-tuning and fine-tuning reduces as the model size increases (Liu et al., 2021b).In this work, we conduct experiments comparing the performance of base vs large variants of T5 for a range of different fine-tuning methods as shown in Table 9.Unless otherwise specified, we use the T5-base model for our experimentation.We observe a consistent improvement in performance with large language models.Specifically, modeltuning approaches achieve a consistent improvement of approximately 10 points across 32 to 1024 training instances, while prompt-tuning without initialization achieves an improvement of roughly 6 points.However, prompt-tuning with initialization and ATTEMPT do not show significant improvement in performance with large models, and this improvement diminishes as the number of training instances increases.The limited performance gain from large models leads us to conclude that multitask model-tuning outperforms prompt-tuning and

Figure 1 :
Figure 1: Comparison of Multi-Task Model-Tuning (FT-MT), Prompt-Tuning with a format based (PT-F) and a format agnostic (PT-C), and ATTEMPT (ATT-MT) (Asai et al., 2022) (a complex prompt transfer learning approach) for Unified QA on target QA datasets in a 32-shot scenario using T5-Base as the backbone model.The results show that prompt-tuning with prior outperforms multi-task full-model fine-tuning and ATTEMPT does not provide any additional advantage, especially for out-of-domain target tasks.Here IID has been used to refer to In-Distribution while OOD refers to out-ofdistribution.
agree on their answers?Fig 7 shows the average agreement of different models on all the tasks across different few-shot scenarios.
.1 Effect of Pre-training Pre-training improves performance in few-shot scenarios, particularly in the lower range, with significant benefits observed in prompt-tuning.Following Unified-QA(Khashabi et al., 2020), we observe that pre-training the T5-base model on diverse source datasets with varying formats and skill requirements (as shown in Table2) can boost the performance of the pre-trained language model (PrLM) in both fine-tuning and prompt-tuning scenarios.Our analysis reveals that pre-training can yield substantial performance gains through knowledge transfer from source tasks, especially when few training examples are available (refer to Figure1).We further observe that prompt-tuning with a pre-trained LM introduces inductive bias in prompts, resulting in a much greater performance boost than FT-MT, with the difference becoming more pronounced as the number of instances increases (potentially due to overfitting).Specifically, PT-R yields a change in improvement from 36% to 24% as the number of training instances increases from 16 to 1024, while improvement in FT-MT drastically reduces from 27% to 7%.We note that ATT-MT follows a similar pattern to that of Model Tuning (MT).Moreover, our findings indicate that datasets such as COSMOSQA, OBQA, DREAM, MCTest, IIRC, and BoolQ exhibit substantial performance gains through pre-training, likely due to their similarity to some of the source datasets.On the other hand, datasets such as McTACO, QuaRel, ShARC, and PIQA, which are less closely related to the source datasets, do not exhibit significant improvements with pre-training.
Figure 3: Heatmaps showing union matrix of different shots for each mode

Figure 4 :
Figure 4: Graphs showing task level agreement across different shots for different modes

Figure 5 :
Figure 5: Graphs showing task level agreement across different shots for different modes

Figure 7 :Figure 8 :
Figure 7: Heatmaps showing agreement matrix of different modes under different few shot settings

Figure 9 :
Figure 9: Comparison of FT-MT, PT-F, PT-C and ATT-MT in several few-shot scenarios using T5-Base as the backbone model for different domains

Figure 10 :
Figure 10: Comparison of FT-MT, PT-F, PT-C and ATT-MT in several few-shot scenarios using T5-Base as the backbone model for different reasoning skills

Table 2 :
Question Answering (QA) datasets used as source and target datasets in this study.For each dataset, the table provides details on associated reasoning skills, domain, and the number of training examples available.RC stands for reading comprehension

Table 3 :
Question Answering (QA) datasets used as source datasets in this study.For each dataset, the table provides details on associated reasoning skills, domain, and question format including Extractive (EXT), Abstractive (ABS) and Multi-choice (MCQ) questions.
same domain and reasoning skills as one or more of the source datasets.Examples of such datasets include MCTest and BoolQ, which cover generic domains and reasoning skills such as Wikipedia reading comprehension, respectively.

Table 4 :
Question Answering (QA) datasets used as target datasets in this study.For each dataset, the table provides details on associated reasoning skills, domain, and question format including Extractive (EXT), Abstractive (ABS) and Multi-choice (MCQ) questions.

Table 5 :
Table7details the initialization used for different target tasks in both PT-F and PT-C.We select the best checkpoint based on the validation set performance, with FT-MT and ATT-MT using a single validation set comprising of all the target tasks, and PT-R, PT-F, and PT-C using a validation set for each target task individually.We evaluate the best checkpoint on the test set of each target dataset using F1 as the metric for extractive and abstractive QA datasets, and accuracy for MCQ and Yes/No QA datasets.Comparison of Model-Tuning and Prompt-Tuning Paradigms in Few-Shot setting: Model-tuning approaches include FT and FT-MT, while PT-R represents vanilla prompt tuning and PT-F and PT-C correspond to prompt tuning with initialization.ATT-ST and ATT-MT are single-task and multi-task variants of ATTEMPT, a prompt transfer learning approach.Bold values indicate the best model with a T5-base backbone for the k-shot scenario, while underline represents the second-best.
In cases where a test set is unavailable, we use the development set to report our model's performance and create a small subset from the training set for hyperparameter tuning and checkpoint se-

Table 7 :
Source Prompts most similar to target prompts for format-based and complete-set initialization corresponding to PT-F and PT-C respectively.Bold indicates source tasks common in both partitions.Although some source prompts are shared across target tasks, Quoref and COPA have none in common.The SIQA

Table 8 :
Table presenting qualitative examples showing model predictions across different shots for different tasks.Few shot column shows the shot until which the predictions in the table hold.

Table 10 :
Categorization of Target Datasets Based on Domain and Reasoning Skill 8216

Table 12 :
Table displays the aggregate standard deviation of target tasks with different seeds.Increasing training instances reduces standard deviation, improving model robustness and reducing sensitivity to minor variations.PrLM reduces standard deviation across all approaches, leading to stable performance and better generalization while addressing overfitting.Prompt tuning has a higher deviation due to initialization sensitivity.Parameter-sharing and prompt initialization techniques reduce deviation, leveraging knowledge from other tasks for stable performance, especially in low-resource scenarios, and mitigating overfitting.Model tweet_qa ropes cosmos_qa piqa CQA dream obqa reclor sharc quarel mctest mc_taco boolq copa quoref iirc agg