Bayesian Multi-Task Transfer Learning for Soft Prompt Tuning

Prompt tuning, in which prompts are optimized to adapt large-scale pre-trained language models to downstream tasks instead of fine-tuning the full model parameters, has been shown to be particularly effective when the prompts are trained in a multi-task transfer learning setting. These methods generally involve individually training prompts for each source task and then aggregating them to provide the initialization of the prompt for the target task. However, this approach critically ignores the fact that some of the source tasks could be negatively or positively interfering with each other. We argue that when we extract knowledge from source tasks via training source prompts, we need to consider this correlation among source tasks for better transfer to target tasks. To this end, we propose a Bayesian approach where we work with the posterior distribution of prompts across source tasks. We obtain representative source prompts corresponding to the samples from the posterior utilizing Stein Variational Gradient Descent, which are then aggregated to constitute the initial target prompt. We show extensive experimental results on the standard benchmark NLP tasks, where our Bayesian multi-task transfer learning approach outperforms the state-of-the-art methods in many settings. Furthermore, our approach requires no auxiliary models other than the prompt itself, achieving a high degree of parameter efficiency.


Introduction
Large-scale pre-trained language models (PLMs) have been recently fine-tuned for various NLP tasks (Devlin et al., 2019;Raffel et al., 2020a).Due to the computational challenges of training the extensive parameters in PLMs, there is a growing focus on methods that efficiently tune fewer parameters (Houlsby et al., 2019;Ben Zaken et al., 2022).
Figure 1: Two key steps for Bayesian Multi-Task Prompt Tuning (BMTPT) are illustrated.First, we merge the posterior distributions of each source task to form a global posterior distribution.This distribution is approximated using Stein Variational Gradient Descent (SVGD), a particle-based variational inference method.Finally, we adapt to the target task by using the derived posterior from the source tasks as a prior.Black and red arrowed lines denote prior works and BMTPT, respectively.
One of the promising approaches is prompt tuning (PT, Lester et al. 2021), where a few adaptable vectors are added as prompts to the input of the downstream task (Lester et al., 2021;Li and Liang, 2021).PT freezes the PLM model parameters and limits the learning to prompts, yet it achieves impressive performance.However, it is still challenging to achieve the same level of performance as the full fine-tuning, as well as to mitigate sensitivity to initialization (Zhong et al., 2022).
To address these challenges, recent works (Wang et al., 2023;Asai et al., 2022;Vu et al., 2022) proposed to adopt multi-task transfer learning approach, where the prompt is trained from multiple arXiv:2402.08594v1[cs.CL] 13 Feb 2024 source tasks to be applied to the target task.Specifically, they train real-valued vectors for prompts (i.e.soft prompts) on source tasks and use them as the initialization of prompt for the target task.However, it is unclear whether aggregating such individually-trained prompts provides a reliable initialization point and fully harnesses the benefits of multi-task transfer learning.
In this paper, we propose Bayesian Multi-Task Prompt Tuning (BMTPT) as a practical yet effective solution to this challenge.Unlike traditional methods of prompt tuning grounded in transfer learning, our approach engages with the posterior distribution of prompts across a multitude of source tasks.For the transference of knowledge gleaned from source tasks, we utilize the source prompts' posterior distribution as the prior for the designated target task.This Bayesian method of transfer learning augments the conventional transfer learning framework, which primarily learns the initialization point of the target prompt from the source tasks.Specifically, BMTPT employs Stein Variational Gradient Descent (SVGD, Liu and Wang 2016), a particle-based Variational Inference (VI) method, to approximate the source prompts' posterior distribution.Further elaboration on this method is provided in Section 2.2.
We validate our approach through experiments on 21 datasets across diverse NLP tasks and output formats.The experimental results demonstrate that BMTPT achieves comparable or superior performance to strong state-of-the-art parameter-efficient fine-tuning methods (Asai et al., 2022;Wang et al., 2023) as well as full fine-tuning, while utilizing a very small number of parameters and requiring no auxiliary models other than the prompt itself.

Background 2.1 Transfer Learning for Prompt Tuning
Fine-tuning entire models for downstream NLP tasks, particularly with a Large Language Model (LLM), can be expensive in terms of training costs.Therefore, parameter-efficient tuning focuses on limiting the updates to a small set of parameters.Various approaches have been proposed, such as Adapter (Houlsby et al., 2019) and its variants (Karimi Mahabadi et al., 2021a;Hu et al., 2022) that involve inserting trainable layers, and BitFit (Ben Zaken et al., 2022) that only trains bias weights while keeping other weights intact.
Recently, there has been growing interest in prompt tuning (PT).This approach involves updating only the 'soft prompt', a set of continuous vectors that are prepended to the input.We can formally describe PT as follows: consider an input sequence x, and a soft prompt θ ∈ R l×d with length l and dimension d, which matches the language model's (LM) embedding dimension.The soft prompt is prepended to the sequence x and then processed by the LM, resulting in the prediction of the target sequence y.
Our work aligns closely with recent efforts to transfer soft prompts from source tasks in order to initialize prompts for target tasks.For instance, SPoT (Vu et al., 2022) retrieves a source task prompt based on similarity to initialize the target task prompt, while ATTEMPT (Asai et al., 2022) employs an attention mechanism to initialize the prompt for the target task using information from the source prompts2 .The most recent method for prompt tuning transfer, MPT (Wang et al., 2023), decomposes source prompts into shared and taskspecific parts to reduce interference between tasks during aggregation.
However, these strategies may not fully address the inherent heterogeneity within source task distributions.They can falter, especially when attempting to aggregate prompts that have been trained across various tasks.Particularly, these issues persist even when each source task's posterior distribution follows a Gaussian distribution.Further discussion on this subject can be found in Appendix A.
Hence, an integrated approach regarding source task distributions may prove advantageous if a representative knowledge set can be constituted for transfer to the target task.This paper takes a Bayesian approach to transferring prompts from source tasks to target tasks.Instead of learning prompts individually then aggregating them, we use the full posterior distribution of prompts across the source tasks.Since this is intractable, we approximate the posterior via sampling, and leverage these samples for training the prompt for the target task, which corresponds to setting the posterior as the prior of the target prompt.

Particle Based VI and SVGD
Variational Inference (VI) is a widely used approach in machine learning for distribution approximation, notable in Bayesian Neural Networks (BNNs) (Blundell et al., 2015;Graves, 2011).Despite its computational simplicity, it often restricts the family of distributions, a limitation that is less present in methods like MCMC (Gilks et al., 1995;Doucet et al., 2001;Robert and Casella, 2004).
Particle-based VI methods provide an alternative approach by drawing upon the strengths of both VI and MCMC.Unlike traditional VI methods, particle-based VI does not restrict itself to a specific family of distributions.This flexibility allows it to approximate a wider range of complex and diverse distributions (Liu and Wang, 2016;Zhang et al., 2020;Naesseth et al., 2018).However, its theoretical guarantees are not yet fully understood.Assumptions often made, such as the presence of infinite particles or adherence to simple distributions like Gaussian, may not hold in practical scenarios (Naesseth et al., 2018;Salim et al., 2022;Sun et al., 2022;Liu et al., 2023).
Stein Variational Gradient Descent (SVGD, Liu and Wang 2016) is a significant advancement in particle-based VI.SVGD applies a transformation to particles to make them more representative of the target distribution through an iterative process.Specifically, for let particles tend to position themselves as though they were samples drawn from the distribution p, The update rule of SVGD is described as follows: where k(•, •) is the positive definite kernel function, such as RBF, and α is the learning rate.
Despite its merits, SVGD can face mode collapse (Chen and Ghattas, 2020;Liu et al., 2022).One workaround, Damped SVGD (Ba et al., 2022), mitigates this by adjusting the deterministic bias in the update rule, which is used in our work.For a more thorough mathematical explanation, kernel details, and information about damped SVGD, we direct readers to Appendix B.

Problem Setting
In this section, we formally introduce core elements, symbols, and problem statements that form the basis of our approach.We denote the trainable parameter of the soft prompt as θ ∈ R l×d , characterized by its length l and the dimension d of the Language Model (LM).For clarity, we use θ S and θ T to denote the soft prompts for source tasks and target task(s), respectively.This implies that the soft prompt θ is prepended to the sequence x, prior to its processing by the LM.The underlying objective is to predict the target sequence y.We denote the dataset for the k-th source as D S k , and define The target task is represented as D T during task adaptation.Thus, the n-th instance in the D S k dataset will be represented as (x k n , y k n ).Note that the log-likelihood log p(D S k |θ S ) for the k-th source task can be represented as follows: In this formulation, p LM denotes the likelihood determined by the LM and the corresponding criterion.
Next we state our Bayesian objective, aiming to optimize the target task prompt using the posterior of source prompts for a transfer learning scheme.
Problem Statement.The objective is to maximize the posterior probability of the target prompt θ T , as expressed by the following equation: where p(D T |θ T ) is the likelihood and p(θ T |D S ) is the prior that is learned from the source tasks in prior to the target task adaptation: In this context, the prior distribution p(θ T | D S ) serves as a guide for the target task adaptation.We model p(θ T |θ S ) as the multivariate Gaussian with mean θ S , since without any information on the target task, it is natural to have θ T = θ S .
This problem formulation provides a general framework subsuming conventional transfer learning method for prompt tuning.For example, we could approximate the above integral in Eq. (2) defining the prior on θ T using a prompt trained from source tasks θ S * , i.e. p(θ S |D S ) =δ θ S * (θ S ), which would be roughly equivalent to the conventional transfer learning setting where the source prompt serves as the initialization of the target prompt.
Assuming an uninformative prior for the source prompt θ S (e.g.uniform distribution) as well as independent selection of source tasks, the posterior distribution p(θ S |D S ) for source tasks is formulated as the product of the posteriors of each task.
Remark.Assuming the uniform prior for θ S and independent selection of source tasks, the global posterior p(θ S | D S ) is proportional to the product of posteriors:

Approach
Instead of optimizing individual prompts for each source task in isolation, our method primarily revolves around learning the posterior distribution of source prompts across all source tasks.This approach assigns a larger probability mass to those prompts capable of addressing a greater number of source tasks, thereby potentially becoming more suitable candidate prompts for the target task as well.We implement this concept by using particles to approximate the posterior distribution.The following subsections provide a detailed explanation of this methodology.

Main Strategy
The optimization of the target task prompt is modeled as a MAP inference in Eq. (1) using p(θ T |D S ) as the prior.We approximate this with M particles {θ S i } M i=1 (each particle corresponds to a soft prompt) drawn from p( • |D S ) using SVGD: (3) For task adaptation, i.e. obtaining the prompt for the target task, the objective Eq. ( 1) is achieved based on the approximation provided by Eq. ( 3): The pseudo-code of our BMTPT algorithm is shown in Algorithm 1.
For practical purposes, we can minimize the second term of the objective J θ T by applying Algorithm 1 Bayesian Multi-Task Prompt Tuning Input: D S , D T : source tasks and target task Θ 0 = {θ 0,i } M i=1 : initialized particle set Source Posterior Learning: Output: θ T * : trained weight for the target task Jensen's inequality, as demonstrated below: (5) where σ and C are constants arising from the multivariate isotropic Gaussian assumption of p(θ T |θ S ).Combining Eq. ( 4) and Eq. ( 5), the final loss for target adaptation is, therefore: This objective suggests that, during target adaptation, we can initialize θ T with the average value of the optimized particles θ T ← θS .

Source Task Sampling
As transfer learning prepares for unknown arbitrary target tasks, usually it is considered preferable that various source tasks are learned.However, it is burdensome to calculate the training losses of all source tasks if the number of source tasks K increases.Therefore it is necessary to alleviate the bottleneck coming from a large number of source tasks.To this end, we use an approximate posterior distribution instead of the true global posterior ), thereby forming a batch of size M × K.The cross-entropy loss is computed based on the difference between the batch of model output and the correspondingly structured repeated labels.The loss signal is back-propagated and provides the derivative of the log posterior in the SVGD update rule.The fire and snowflake icons denote the trainable and frozen parts, and <bos> signifies beginning of sentence token.
distribution.Specifically, during each source posterior learning iteration, we uniformly sample κ tasks from the K source tasks (κ < K) without replacement and constitute a batch with the data entries from that κ tasks.

Composition of θ T and Multi-target Task Adaptation
At the start of the target adaptation, we compose θ T with element-wise multiplication of a full-rank matrix which is initialized with θS and a low-rank matrix whose elements are all 1, where both matrices are learnable and have the shape of (l, d).The low-rank matrix is made by ab T where a = 1 l and b = 1 d and both a, b are trainable components.Importantly, during target adaptation, we adopt a two-speed learning rate scheme for the full-rank and low-rank matrices by setting a higher learning rate for the low-rank matrix (Ponti et al., 2022;Asai et al., 2022;Wang et al., 2023).This facilitates multi-target task adaptation, by employing multiple low-rank matrices to assign each low-rank matrix to each target task, while sharing the fullrank matrix among all target tasks.In doing so, the full-rank matrix captures the shared knowledge across tasks, while the respective low-rank matrices capture the task-specific knowledge (Wang et al., 2023).We also apply this scheme to single-target adaptation, as we empirically observed that the use of two-speed learning rate promotes faster perfor-mance convergence.

Source Task Posterior Learning
Unlike previous transfer learning methods in prompt tuning that individually train prompts for each source task, we approximate the global posterior distribution of source tasks, by employing M particles.Here, a particle corresponds to one instance of soft prompt.Each particle is initialized with randomly sampled tokens, following Lester et al. (2021).We pack a batch as depicted in Figure 2: each particle θ S i (which is an instantiation of soft prompt, and 1 ≤ i ≤ M ) is prepended to input texts from K source tasks, forming a batch of size M × K.It is worth noting that as we want to sample from p(θ S |D S ) ∝ p(D S |θ S )p(θ S ) using SVGD, we can substitute the log-posterior log p(•) in SVGD update rule with log p(D S |θ S ) since we assume the prior p(θ S ) is uniform.In practice, we calculate the minus of the cross-entropy loss of the language model given the particles as soft prompts, for log p(D S |θ S ).Ideally, our SVGD update should be based on the full batch by appending all x in D S to each θ S i and measuring the crossentropy loss w.r.t.all y in D S , to approximate the global posterior with best accuracy.Since this is computationally infeasible, we only sample single (x k , y k ) pair from D S k as a proxy for the true global posterior.Note that we employ a limited number of SVGD particles, usually M ≤ 10.We perform 100K SVGD updates to approximate the sampling of θ S from p( • |D S ).

Target Task Adaptation
With the initialized θ T , we start target task adaptation.The loss for the adaptation process is Eq. ( 6), which is the combination of Maximum Likelihood Estimation (MLE) loss with respect to D T and minus of the average of log priors.

Efficiency of BMTPT
Recent prompt tuning transfer methods primarily focus on measuring the efficiency during target adaptation, overlooking the need to evaluate the efficiency of source task training phase, which is helpful for identifying potential bottlenecks.We highlight the efficiency of BMTPT in comparison to the most recent prompt tuning transfer methods, ATTEMPT (Asai et al., 2022) and MPT (Wang et al., 2023) BMTPT, on the other hand, proves to be efficient in both the source posterior learning and target adaptation stages, when evaluated under criteria of computational and space complexity.The additional intricacies that BMTPT introduces, compared to vanilla prompt tuning, are the use of SVGD during source posterior learning and the computation of regularization terms derived from the prior during target adaptation (Eq.( 6)).In terms of computational complexity, given that the SVGD step used in BMTPT primarily involves computing RBF kernel values among a limited number of particles, the computational cost is minimal.Likewise, the regularization calculation during target adaptation is also negligible.On the aspect of space complexity, BMTPT continues to exhibit efficiency.During source posterior learning, as BMTPT accompanies SVGD particles only, the memory space that BMTPT requires is occupied by the backbone LM parameters and the SVGD particles which are comprised of M • l • d trainable parameters.Since we employ a small number of particles, the memory consumption by SVGD particles is almost negligible.During target adaptation, as we compose one target task prompt with shared matrix (full-rank) and task-specific matrix (low-rank), BMTPT requires (l • d)/N + (l + d) trainable parameters per one target task, when we adapt on N target tasks.This makes BMTPT train only 0.035% parameters compared to full fine-tuning.For a detailed analysis, we direct the reader to Appendix C.

Constrast with Conventional Multi-Task Learning
Both BMTPT and traditional multi-task learning algorithms have a common point in that they utilize multi-source data.However, BMTPT uses multisource data to find a posterior distribution across the multi-source data and transfer the posterior to target domain, under the Bayesian perspective.Traditional multi-task learning methods, on the other hand, optimize network parameters with respect to MLE objectives in general.

Distinctive Motivation behind BMTPT
BMTPT focuses on the core of transfer learning by conducting the useful distribution as a starting point for adapting the target.Unlike existing prompt transfer methods such as SPoT, ATTEMPT, and MPT, which depend on the transferability between specific NLP tasks (for instance, SQuAD being more advantageous for solving MRPC than SST-2), BMTPT is designed to be dataset-agnostic.This approach allows for a more varied application across various tasks without relying on taskspecific transferability.The experimental findings shown in Section 6 support the efficacy of this particular motivation.

Datasets and Tasks
As in previous works (Asai et al., 2022;Wang et al., 2023), We use a set of 6 extensive datasets as source tasks and assess the performance of our algorithm on a range of 21 distinct target tasks, encompassing entailment, paraphrase detection, sentiment analysis, question answering (QA), and commonsense reasoning.

Implementation Details and Baselines
Implementation Details Throughout the experiments, we use T5-base as the base LM for BMTPT and all of the baselines, and we use a prompt of length 100.Unless specified differently, we employ 5 particles for SVGD and use 6 source tasks as mentioned in Subsection 5.1, therefore forming a batch of size 30 (5 × 6).Also, we use σ =10 5 for target adaptation loss, denoted at Eq. ( 6).
For the two-speed learning rate, we set 0.3 as the full-rank matrix learning rate and 0.4 as the low-rank matrix learning rate.We use a batch of size 32 during target adaptation.For multi-target task adaptation, we first form a batch of input texts from target tasks, using example-proportional mixing strategy (Raffel et al., 2020b), then prepend a corresponding target prompt to each input text in the batch.We ran all the experiments three times using different random seeds and provided the mean and standard deviations of the results.In cases where a dataset lacks a publicly available test split with annotations, we adopt either the original development set as our test set or perform a split within the original development set to create separate development and test sets, following Mahabadi et al. (2021).
Also, to compare our algorithm with conventional multi-task transfer learning, we implement and evaluate a vanilla multi-task transfer method that learns a single prompt upon the combined loss of source tasks and transfers it to the target task.We either directly quote reported numbers or utilize publicly available source code under the same backbone for a fair comparison, as outlined in the respective papers (Mahabadi et al., 2021;Karimi Mahabadi et al., 2021b;Asai et al., 2022;Wang et al., 2023).

Results
In Section 6.1, we provide the main findings on GLUE and SuperGLUE benchmarks.In Section 6.2, we further provide a set of analyses.For findings on MRQA and "Others" benchmarks, please refer to Appendix D.

GLUE and SuperGLUE
As shown in the top part of Table 1, BMTPT achieves new state-of-the-art results in parameterefficient fine-tuning for both GLUE and Super-GLUE, outperforming other prompt tuning transfer methods (Vu et al., 2022;Asai et al., 2022;Wang et al., 2023).Compared to vanilla PT (Lester et al., 2021), BMTPT demonstrates a relative improvement of 16.5% on GLUE and 16.8% on Super-GLUE.This highlights the advantages of transferring knowledge using Bayesian approach.It is worth mentioning that BMTPT outperforms the full fine-tuning baseline on both benchmarks, despite only tuning 0.035% of the parameters compared to full fine-tuning.The results presented in the bottom part of Table 1 demonstrate the ability of BMTPT to effectively utilize multi-task knowledge during finetuning on a group of target tasks.This highlights that BMTPT can benefit from multi-target adaptation setting, by further reducing the number of trainable parameters.
We also compare the performance of BMTPT and vanilla multi-task transfer that is introduced in Section 5.2, in Table 1.Surprisingly, vanilla multitask transfer shows strong performance in GLUE and SuperGLUE tasks, outperforming competitive baselines.This result supports Section 2.1 which claims that previous methods (Vu et al., 2022;Asai et al., 2022;Wang et al., 2023) are not the optimal transfer technique.It is worth noting that BMTPT outperforms vanilla multi-task transfer.To understand this advantage, we may delve into the Bayesian perspective of BMTPT, which includes conventional transfer learning.While vanilla multitask transfer only learns an initialization point that contains relatively limited source task information (Shwartz-Ziv et al., 2022), BMTPT learns posterior from the source tasks and adopts it as prior during target adaptation, enabling a richer and more insightful adaptation process.

Few-Shot Experiments
We also present the results of the few-shot experiments on the GLUE and SuperGLUE datasets.For the 4-shot experiments, the learning rates were reduced to one-third of their original values to accommodate the decreased batch size relative to standard experiments.The performance figures for BMTPT are averaged over three runs, each initialized with a different random seed.These outcomes suggest that the prior used in target adaptation effectively positions the prompts to an optimal initial point for task adaptation in low-resource conditions.

Model Scaling
We perform scaling experiments to analyze the performance of BMTPT as the size of the pretrained model increases.The result demonstrates that BMTPT can largely benefit from scaling LM to larger models.This aligns with the finding by (Lester et al., 2021), which suggests that prompt tuning is effective especially when applied to larger backbone LMs.Note that BMTPT achieves comparable performance to fully fine-tuned models even with T5-base, meaning that BMTPT is effective across various model scales.

Effectiveness of Source Task Sampling
To evaluate the effectiveness of Source Task Sampling discussed in Section 4.2, we conducted experiments under two settings: (1) subsampling 3 tasks from a pool of 6 source tasks (refer to Section 5.1), to examine if Source Task Sampling can mitigate performance degradation at limited computation resource scenario, and (2) diversifying the source task set to include 12 tasks and subsampling 6 tasks from this expanded set, to investigate the potential benefits of Source Task Sampling with an expanded source task set.For the second setting, we expand the source task set with AGNews (Zhang et al., 2015), CommonsenseQA (Talmor et al., 2019), OpenBookQA (Mihaylov et al., 2018), ARC (Clark et al., 2018), adversarial NLI (Nie et al., 2020), and Winogrande (Sakaguchi et al., 2020).
From Table 3, we can see that setting (1) shows minimal performance degradation compared to the case with 6 source tasks.This finding indicates the successful application of the Source Task Sampling technique in low computation resource scenarios.Also, setting (2) demonstrates slight performance enhancements, suggesting that Source Task Sampling can derive benefits from diversifying the source task set.

BMTPT Performance on Different Numbers of Particles
Since SVGD is a particle-based VI method, the number of particles employed may affect the performance of our method.Therefore we investigate the effect of the number of particles on target adaptation performance by comparing 5-particle BMTPT and 10-particle BMTPT (Table 3).We found that the 10-particle case does not yield better results than the 5-particle case.Because of the instability reported in the original SVGD paper (Liu and Wang, 2016) and a similar empirical finding from Yoon et al. (2018) we speculate that this absence of enhancement might be attributed to the inherent characteristics of SVGD, including its sensitivity to kernel function parameters.

Effect of Prior
To assess the impact of the prior term in Eq. ( 6), we conducted an ablation experiment by removing the prior term from the target adaptation loss.The ablated version of BMTPT exhibited poorer performance, implying the efficacy of learning an informative source posterior and leveraging it during target adaptation to facilitate effective transfer learning.

Limitations and Future Direction
While showing compelling experimental results with only the use of a soft prompt, BMTPT has its limitations.A primary issue is the increase in overall input length due to appending the soft prompt to the input text, which consequently raises the memory footprint.This challenge is well-documented in prompt tuning literature (Karimi Mahabadi et al., 2021a), and BMTPT encounters this problem as well.
Additionally, in BMTPT, since multiple particles are used and source task sentences are appended to each, this results in a batch size that grows with the number of particles.This expansion can potentially heighten memory demands during the source posterior learning phase.Mitigation strategies, such as source task sampling or reducing the number of particles, may alleviate this issue.Experiments determining the optimal number of particles have not been performed in our study, and future research could potentially explore this aspect to ascertain the most appropriate number of particles.
Furthermore, it is recognized that SVGD may suffer from variance collapse if the number of particles is not sufficiently large compared to the particle dimension.We hypothesize that the samples of θ S can be predominantly positioned near the peak of the global posterior distribution during the source posterior learning process.
However, on a related note, averaging the SVGD particles {θ S i } i∈[M ] can be thought of as averaging models located near the main mode of the distribution we are pursuing.Given that recent studies (Wortsman et al., 2022;Gueta et al., 2023) have illustrated that the most accurate solutions often emerge from the midpoint of fine-tuned models, the averaging scheme in our method may produce an effective midpoint with higher performance.Therefore it would be interesting for future research to study methods that effectively find the realm of high-performing area in weight space (Gueta et al., 2023) and draw a midpoint that works well in various downstream tasks or validation splits, possibly without discussion on the Bayesian framework we used in this work.

Conclusion
We present Bayesian Multi-Task Prompt Tuning (BMTPT), a Bayesian approach for transferring soft prompt.Our method defines a posterior distribution over prompt on source tasks, and approximates the posterior using SVGD, then initializes the target prompt with aggregation of source prompts while regularizing the training of the target prompt using transferred posterior.Empirically we found this approach achieves comparable or superior performance over strong parameter-efficient fine-tuning baselines.
Despite demonstrating superior performance, BMTPT encounters limitations such as increased memory requirements due to extended input lengths and heightened memory demands from managing duplicated batches for multiple particles in SVGD.Notably, variance collapse in SVGD makes it harder to estimate distribution, but it can also improve performance through model averaging.Future research will focus on optimizing the number of particles to reduce memory constraints, investigating the impacts of variance collapse, and developing strategies to harness SVGD more effectively.These initiatives aim to enhance BMTPT's efficiency and broaden its applicability.

A Analogy with Gaussians for the Aggregation of Prompts Trained on Diverse Tasks
Consider a scenario with K source tasks, each characterized by a posterior p(θ|D k ) = N (µ k , Λ −1 k ), where D k represents the dataset of k-th task, and θ is a soft prompt.Under the uniform prior, maximizing the likelihood (MLE) is equivalent to MAP estimation, and would lead each source prompt trained on task k to the mode µ k .By combining individual posteriors with assuming independent selection of tasks, we can construct the global posterior p(θ|D ≡ K k=1 D k ) ∝ K k=1 p(θ|D k )3 .The goal of transfer learning is to maximize this posterior, anticipating that the overall knowledge captured from source tasks will lead to a good starting point for a target task.Note that the posterior, which is a product of Gaussian distributions, is a Gaussian distribution whose mean is Since the mean is the mode of a Gaussian, µ global would be a good candidate for the initialization point of the target prompt.However, unless the covariances differ only by a scaling factor, a weighted sum of the individual modes {µ k } K k=1 is unlikely to equal µ global .

B Details for SVGD B.1 Choice of SVGD
Stein Variational Gradient Descent (SVGD) is a nonparametric variational inference technique that amalgamates the benefits of Markov Chain Monte Carlo (MCMC) and variational inference (Liu and Wang, 2016).Our utilization of SVGD over conventional variational inference (VI) methods is driven by multiple factors, each rooted in the limitations and attributes of standard VI approaches.
The target posterior distribution we aim to approximate is complex, potentially even multi-modal.Standard VI methods, constrained by a specific family of distributions, often fail to capture such intricate structures.Therefore, they can show an inherent bias toward particular tasks.In contrast, SVGD employs a particle-based approach to dynamically generate a more expansive class of approximating distributions.This capability allows SVGD to represent complex and multi-modal distributions with greater accuracy.
Furthermore, traditional VI methods like Variational Autoencoders (VAE) are generator-based and necessitate sampling.In contrast, SVGD requires the log derivatives of the prior at each point, commonly referred to as the score function.Additionally, while most VI methods aim to minimize surrogates of KL divergence through optimization, SVGD employs a first-order update method with a competing mechanism between particle repulsion and gradient descent.

B.2 Mathematical Explanation
Whereas gradient descent guides particles towards the optimal direction of fastest objective decrease, SVGD identifies the optimal transformation to minimize the KL divergence between the current and target distributions.
To find the optimal direction in the unit ball B of the Reproducing Kernel Hilbert Space H, which is the closed linear span of {k(θ, •) : θ ∈ R D }, that minimizes the KL-divergence towards the target distribution p, SVGD uses the point transformation T [αϕ ϕ ϕ] (θ) = (I + αϕ ϕ ϕ)(θ).We will use the same notation for probability density with probability measure µ, if there is no confusion.Specifically, it finds ϕ ϕ ϕ * that satisfies: where T#µ(A) = µ(T −1 (A))4 .The closed-form solution of the above is given by: Here, log p(θ) is the log-likelihood of p.The SVGD algorithm updates the distribution as follows: where α is the step size.Discretized version of the above update rule for a finite set of particles {θ i } M i=1 , SVGD iteratively transports the particles using the following update rule for i, j ∈ [M ]: The behavior inherent to SVGD is orchestrated by the two terms in the update, which define the key control mechanisms.Firstly, the first term entails the sharing of gradient information among particles, guiding their update trajectory.Additionally, the influence of neighboring particles is modulated by kernel distance weighting.The second term, ∇ θ j k(θ j , θ i ), introduces a repelling force between the particles, preventing them from converging to a single mode.

B.3 Detailed Explanation for RBF Kernel
In the execution of the Stein Variational Gradient Descent (SVGD) for our set of particles denoted as {θ i } M i=1 , we adopted the Radial Basis Function (RBF) kernel, which is defined as follows: 2 log(M + 1) .
In this formulation, h is a parameter frequently adjusted according to the distances between particles.
As part of our methodology, we adhere to the median heuristic, a strategy supported by previous studies (Schölkopf and Smola, 2018;Ba et al., 2022).This entails designating the bandwidth as the median of the set of mutual distances between particles.

B.4 Damped SVGD
The variant of Stein Variational Gradient Descent (SVGD) we employ in this work is Damped SVGD, as delineated in the work by Ba et al. (2022).SVGD, in its typical implementation, is prone to variance collapse when applied in a finite regime with particles, rather than updating distributions directly.This necessitates an adaptation of the SVGD's update rule to ensure a proper approximation of the distribution with particles.The Damped SVGD specifically addresses this issue by moderating the influence of its own gradient descent term.This adjustment can be seen clearly in the update rule for θ i in a configuration {θ i } M i=1 .As compared to the standard update rule, the modification reads: In the Damped SVGD paper, the parameter λ can be chosen using one of two strategies: taking λ as λ min = min 1, e −1 (1 + M l•d ) for "fully damped," or taking λ as a value between λ min and 1 for "intermediate."In our experiments, we use "intermediate" by consistently choosing the value min 1, e −1 (5 + M l•d ) , taking both selections into account.In our standard setting, this yields λ ≈ 0.368.This variant of SVGD improves upon the original by mitigating the issue of variance collapse in some degree.

C Computational Complexity Analysis of BMTPT
Our computational analysis verifies that BMTPT is computationally efficient for both source task posterior learning and task adaptation stages.Note that the additional computation necessary for BMTPT occurs after the prompt receives the back-propagated gradient information from the LLM.

C.1 Definitions of Notations
• M : Number of particles.
• l: Length of the prompt.
• d: Hidden dimension of LLM (Large Language Model).
• d prompt = d × l: Dimension of the prompt.
• T grad : Number of operations for the gradient backpropagation through the backbone LLM.
• Θ: Matrix of prompt parameters with dimensions M × d prompt .
• ∇ log p: Gradient of log-probability for each particle.

C.2 Source Task Training
In the source task training phase, we have a multi-particle formulation governed by SVGD with an RBF Kernel.The formulation involves various matrix and vector products, which we denote as The computational complexity for BMTPT during this phase can be summarized as O(T grad ) + M 2 • O(d prompt ).This indicates that BMTPT requires additional M 2 • O(d prompt ) calculations over the vanilla prompt tuning.However, since T grad is the dominating factor and M 2 = 25 in our experiments, this increase is computationally acceptable.The average wall-clock time recorded during the training of the source task, based on 5 updates, is as follows: 0.42 seconds for the backward pass through the language model (LM) and 0.0035 seconds for Damped SVGD.We used a single GeForce RTX 3090 GPU for these computations.

C.3 Task Adaptation Stage
During the task adaptation stage, the additional computational complexity of BMTPT is mainly due to the upper bound of log prior term, which takes O(d prompt ) computations.Therefore, the computational complexity for a single update is O(T grad ) + O(d prompt ).Here as well, the dominating factor is T grad .Similarly, we report the wall-clock time observed on our device during the target adaptation phase, specifically for the SuperGLUE-CB task with a batch size of 32.The forward pass through the LM takes an average of 0.16 seconds, while the forward pass for the prior term requires 0.00011 seconds.We used a single GeForce RTX 3090 GPU.
D Experiment on MRQA and "Others" Benchmark

Figure 2 :
Figure 2: This illustrates the source task posterior learning in BMTPT, detailed in Sections 4.1 and 4.3.1.For every SVGD update, we initially sample a pair (x k , y k ) from each source dataset DS k (k ∈ [K]), then append {x k } K k=1 to each SVGD particle θ S i (i ∈ [M ]), thereby forming a batch of size M × K.The cross-entropy loss is computed based on the difference between the batch of model output and the correspondingly structured repeated labels.The loss signal is back-propagated and provides the derivative of the log posterior in the SVGD update rule.The fire and snowflake icons denote the trainable and frozen parts, and <bos> signifies beginning of sentence token.

Table 1 :
Experiment results for GLUE and SuperGLUE using T5-base, along with the number of trained parameters.BMTPT results are averaged across three runs, with subscripts indicating the standard deviation.The evaluation metrics are Pearson correlation for STS-B, F1 for MultiRC, and accuracy for the other tasks.Top rows use singletask adaptation with no parameter sharing during the target task adaptation, while bottom rows employ multi-task adaptation.The best performance among parameter-efficient fine-tuning methods is bolded.BMTPT consistently outperforms most baselines in GLUE and is comparable in SuperGLUE, affirming its robustness across language tasks.

Table 3 :
Tablecorrespondingto Section 6.2.We examined BMTPT in larger models and evaluated three components of BMTPT: source task sampling, performance based on the number of particles, and the prior.